prodigy-pdf with Azure AI Document intelligence instead of docling?

pablospe · May 17, 2025, 9:50pm

I was wondering if anyone has tried alternatives to Docling, such as Azure AI Document Intelligence, which provides similar output.

ines · May 18, 2025, 7:32am

Hi and welcome!

In theory, it should be no problem to integrate Azure AI and similar services and use them with prodigy-pdf – maybe someone from the community has even done it already. We definitely have quite a few Azure + Prodigy users, especially in the data-private medical/health field. (To others reading, comment if you have any pointers!)

One approach could be to implement a wrapper similar to spacy-layout – basically, taking a PDF as input and returning a spaCy Doc object. So it would take the output and add the extracted text spans to the Doc.spans (see here for the relevant part of the code). Then it should pretty much work as a drop-in replacement in prodigy-pdf (the code is all open source).

If you want to give it a try, this could also be a potentially great fit for vibe coding: if you give it the Azure AI output, I think a model should be pretty good at rewriting the relevant function in spacy-layout that transforms it into a Doc object. If you try it, definitely let us know how you go!

pablospe · May 19, 2025, 10:25pm

Thank you very much for your response! Yes, I would like to try Do you think it would be better to create a separate wrapper for spacy-layout, or should this functionality be integrated directly into it? For example, there could be an option to choose between using Docling or Azure AI Document Intelligence. An argument to spaCyLayout, or having spaCyLayoutDocling and spaCyLayoutAzure and then choose which implementation to use, I don't know.

I’m not very familiar enough with the codebase, but since spacy-layout is already installed via pip install spacy-layout and is accessible to everyone, I believe adapting the wrapper to handle different backends could be beneficial.

Best regards,
Pablo

pablospe · May 21, 2025, 6:53am

Proposed PR: Add Azure AI Document Intelligence backend support by pablospe · Pull Request #42 · explosion/spacy-layout · GitHub

Topic		Replies	Views
Legal Documents - Process to read raw PDF and extract paragraphs into jsonl format ner , textcat	6	156	January 14, 2025
Document layout analysis usage , image , custom	6	1163	March 10, 2021
pdf.spans.manual	1	61	December 2, 2024
Extracting data from PDFs using prodigy usage , solved , best-practices	2	1106	June 24, 2022
Combining Document Layout Analysis with NLP spacy	1	815	February 26, 2019

prodigy-pdf with Azure AI Document intelligence instead of docling?

Related topics