I was wondering if anyone has tried alternatives to Docling, such as Azure AI Document Intelligence, which provides similar output.
Hi and welcome!
In theory, it should be no problem to integrate Azure AI and similar services and use them with prodigy-pdf
– maybe someone from the community has even done it already. We definitely have quite a few Azure + Prodigy users, especially in the data-private medical/health field. (To others reading, comment if you have any pointers!)
One approach could be to implement a wrapper similar to spacy-layout
– basically, taking a PDF as input and returning a spaCy Doc
object. So it would take the output and add the extracted text spans to the Doc.spans
(see here for the relevant part of the code). Then it should pretty much work as a drop-in replacement in prodigy-pdf
(the code is all open source).
If you want to give it a try, this could also be a potentially great fit for vibe coding: if you give it the Azure AI output, I think a model should be pretty good at rewriting the relevant function in spacy-layout
that transforms it into a Doc
object. If you try it, definitely let us know how you go!
Thank you very much for your response! Yes, I would like to try Do you think it would be better to create a separate wrapper for spacy-layout, or should this functionality be integrated directly into it? For example, there could be an option to choose between using Docling or Azure AI Document Intelligence. An argument to spaCyLayout, or having spaCyLayoutDocling and spaCyLayoutAzure and then choose which implementation to use, I don't know.
I’m not very familiar enough with the codebase, but since spacy-layout is already installed via pip install spacy-layout and is accessible to everyone, I believe adapting the wrapper to handle different backends could be beneficial.
Best regards,
Pablo