Document layout analysis


It has been a long time since i used Prodigy. Now, i need to start a new project and i would like to understand if Prodigy can be used.

I need to analyze semi-structured documents (like resume/cv - papers, etc) and i found problems during the text extraction. I know Prodigy/Spacy do not do text extraction but i will talk about that in a moment.

For text extraction i tried many tools, like Apache Tika, PdfMiner (for pdf documents only, of course) etc
The problem is still the same, after the extraction the blocks of text are often messed up with strange order.

Take a look at the following screen shot:

The result after the extraction using a document like that is:

Oregon Arts Commission Individual Artist..... etc

With that text is pretty much impossible to do NER or others NLP tasks.
So before talking about tokenization, sentences segmentation etc, i need to find a way to do Document Layout Analysis, to correctly extract the text.

This document analysis is a very important task because the text often comes from extractions, so i suppose this topic can also help other people.

Do you think Prodigy could be good to recognize the text of the same block? Maybe converting the documents into images and then used Prodigy to train a model tagging the portions of text?

Please, could you give me an idea to understand what approach can i use?

Thank you very much!

I'm working through a similar use case and am curious what others are doing to handle this. Right now I'm using AWS Textract for OCR then doing something similar to this SO answer to construct regions in the document then pull the text for each bounding box from the Textract output.

It's not terribly robust, but it does work okay for my use case to extract the structure from a page of text. I'm similarly looking at trying to annotate images in Prodigy for a document layout analysis. Trying to find a good example where I feel confident that if I annotate a bunch of images I'll be able to build a decent model to segment the text into regions.

Hi @matthewvielkind
happy to see others are fighiting with this task :smiley: We are not
lonely soldiers.

Said so, getting serious again.
I never used Textract, I can give it a try, but i would like to undertstand what approach to use to address this problem. My problem is not "simply" understanding the class (text, image, title) i also need to understand relations between them, so it is basically a task with two problems:

  1. Understand the content (text, images etc)
  2. Understand if that text/image is related to others (near) blocks.

I hope @ines or @honnibal could give us advice.

Unfortunately I haven't worked on this problem myself, so I don't have very detailed advice. One thing you could try is a computer vision-based approach, treating the PDF as an image and using an object detection algorithm to identify bounding boxes and relations.

Alternatively, you could extract the information into some sort of graph, and use a graph neural network to do the classification.

In my experience, the new Azure Read API works wonders and can even handle fairly complex layouts. It's very new and you have to apply to be approved, but you might want to consider it.

@honnibal thank you for your message!

I found this tool, Bulk PDF to Text Extractor
it is basically converting the pdf into txt with the exact same words positions of the original document.
So extraction is ok, but the text is still not usable in Spacy, for example, this is a piece of the extraction:

                                    Web Developer - 09/2015 to 05/2019
Address:                            Web Design, New York
xxx Great Portland Street, London      • Cooperate with designers to create clean interfaces and
XXX 123                                   simple, intuitive interactions and experiences.
                                       • Develop project concepts and maintain optimal
Phone:                                    workflow.
+44 (0)20 xxxx xxxxx
                                       • Work with senior developer to manage large, complex
                                          design projects for corporate clients.

It is perfect, alignment is ok too and i avoid the hassle of converting text from image like with an OCR.

But, the problem is that Spacy cannot detect sentences, Is it possible somehow? If i use the text .split('\n') then i have mixed columns content. It is perfect for top-down flow analysis but if in my case there could be columns.

Do you think folllowing this approach could be useful or the sentence segmentation is too difficult (..or rather... not applicable)?

@AK_Fischer Thanks, i will try their service too.