Hello everybody,
I am going to train a new model for image segmentation. My aim is to find the various blocks that make up my documents. I basically convert the documents (pdf/doc) into jpg and then the model should give me the bounding boxes.
My problem is that i need to define a bigger box with others smaller inside. Please take a look at the attach.
With the red boxes i define how i would segment the document and with the inner blue blocks other specific information of that segment.
I thought about that because running tools like Pdfminer, apache Tika etc the order of the extracted text is awful. SO i would like to convert the document into an image, and find the bounding boxes and then running an OCR for each specific area.
Can i use Prodigy to train a model that finds bounding boxes on textual images?
Yes, you can, but you will have to build and integrate the model yourself. There's a tutorial on how to use TensorFlow's object detection with Prodigy linked at the end of this page: https://prodi.gy/docs/computer-vision
@AK_Fischer
pardon for asking a stupid question, can a model like that recognizes/understands the text or just the layout/template of the document? When should we use those kind of models?
@damiano, that's not a stupid question at all. In fact, it's my response that was likely much too cursory.
What is your end goal? Simply extracting the texts, or are you interested in the bounding boxes for another reason?
Computer vision tasks, especially those that combine optical character recognition and layout analysis, are typically very complex. You can expect a model trained from scratch to need a lot of deeply annotated data. Unless you have a background in computer vision, I think it would make more sense for you to work at the level of OCR'ed texts. You can use Prodigy's image annotation interface and work with extracted portions of texts under the hood.
@AK_Fischer thank you so much for your reply.
My end goal is to detect the order of the boxes inside the document. There are very good tools like PDFMiner.six, Apache Tika but i often need to fight with the order of the extracted text. It is a very big problem for me because working with mixed blocks of text will also ruin the next tasks like NER, Text classification etc, i will surelly have poor results.
TO give you an example, if i try to extract the text i shown with the above image with PDFMiner i see:
2018 - attuale
PONTREMOLI
2020 - attuale
REMOTO
and then the block
Bartender e bar manager
presso Caffè Bellotti dal 1883
Ideatore e realizzatore del progetto
....
or if i run another test changing PDFMiner config, it returns:
2018 - attuale Bartender e bar manager
PONTREMOLI presso Caffè Bellotti dal 1883
That is wrong too.
With that layout the correct segmentation is shown in the image with red and blue boxes.
Thanks for that information, that makes a lot of sense. So basically, you are looking to "decode" the boxes in the right order so you will get the text in the right order.
I think you have three viable options here.
You could look into tesseract. Tesseract will produce hOCR outputs. You might be able to write a heuristic that works well enough for your use case. I think with medium effort, you might be able to get it to work well enough. Do note that tesseract tends to struggle with lists and with tables, so you will need to build something on top of it.
You could look into Azure OCR. They can already extract the text for you -- look for their "natural reading order output" option. It is a paid service, but it is not too expensive and should work very well out of the box for your use case. This would likely be the solution with the least effort and the best effort/performance ratio.
You could train your own model. For this, you could use Prodigy to help you collect data. If you want to go down this path, I would strongly advise to separate the OCR and the text extraction step. You could annotate boxes to indicate the right order, extract the texts contained in the boxes, and then build e.g. a natural language model to predict the correct order based off the texts. I expect that this will be the solution requiring the most effort. I also expect this to perform worse than Azure's text extraction. If your domain is very, very narrow and you can annotate in excess of 5-10k (better: 30k+) examples, you might be able to build a fairly good model, but it will take a lot of effort.
The feature you are looking for is "natural reading order output" and available in the Read API v3.2 preview. It's a brand new feature, so the non-preview API doesn't have it yet. From what I've seen, it performs very well out of the box, so it might be right for you.
@AK_Fischer it looks very promising, i just wonder one thing.. reading the output i do not see any number that is referring to the order so i suppose the lines are already ordered, ok.
Now i am going to test it deeply.
However, i would like to understand the "concepts" behind the task a little bit.
I mean...ok, it is an OCR, an OCR can be trained with a CNN / LSTM architectures for examples, those are powerful to detect characters, sentences, but...what about the second task? (ordering the boxes)? do you think they use a model or an heuristic/rules based (like distances from words, font size, font style etc) approach ?
I know it is an hard question but i like it and i would like to understand it more.
Thank you
@AK_Fischer one last thing i forgot to ask. Passing an image to the /read endpoint, how do you think we can concatenate pages? call the api for each page OR concatenate the pages to create a big image?