It has been a long time since i used Prodigy. Now, i need to start a new project and i would like to understand if Prodigy can be used.
I need to analyze semi-structured documents (like resume/cv - papers, etc) and i found problems during the text extraction. I know Prodigy/Spacy do not do text extraction but i will talk about that in a moment.
For text extraction i tried many tools, like Apache Tika, PdfMiner (for pdf documents only, of course) etc
The problem is still the same, after the extraction the blocks of text are often messed up with strange order.
Take a look at the following screen shot:
The result after the extraction using a document like that is:
Oregon Arts Commission Individual Artist..... etc
With that text is pretty much impossible to do NER or others NLP tasks.
So before talking about tokenization, sentences segmentation etc, i need to find a way to do Document Layout Analysis, to correctly extract the text.
This document analysis is a very important task because the text often comes from extractions, so i suppose this topic can also help other people.
Do you think Prodigy could be good to recognize the text of the same block? Maybe converting the documents into images and then used Prodigy to train a model tagging the portions of text?
Please, could you give me an idea to understand what approach can i use?
Thank you very much!