Information Extraction for long, semi-unique documents

Hello good people,

First of all, thank you for creating this fantastic tool!

I am stuck with a particular task of information extraction and hoped I can have some of your precious input.

I have a few hundred, long (5-30 pages) pdf, doc and docx project documents from which I seek to extract specific information and store them in a structured database.

NER usually seems like a good solution for this, however, the documents all have very unique structures which can also change from document to document. For instance, a lot of relevant information is stored in tables as numbers, single keywords or longer text, whereas the column names are needed to make sense of them.

As far as I understand, spaCys NER uses the surrounding words to identify entities, hence loading in the whole documents as raw text and manually tagging column keywords and phrases will probably not serve as good data.

It seems I have to do some sort of pre-processing, but I am unsure how I can automate pre-processing of different document types and structures to make it useful for Prodigy.

An example of the document structure is as follows:


Table (0.5 page)

Table (1 page)

Text (0.5 page)

Table (1 page)

Text (3 page) .... and so on.

As a last resort, I thought about just extracting the whole raw text from the documents (only removing linebreaks in python), and manually tag the information I want with ner.manual. However, I am not sure if I can “abuse” ner.manual to annotate whole paragraphs just for information extraction, especially while also using it to annotate real entities.

Also, is there a way to link the extracted information/entities back to the sentence/paragraph they are extracted from. I am asking, because often we will extract the same entity a few hundred times (for instance an organisation like “World Bank” and to make use of this information when query back, I need the surrounding context).

Many questions. My apologies. I hope you can shed some light on these issues.



Thanks for the kind words, I hope we can help you complete your project :). I've answered some of your questions in your other thread; I think it'll probably be easier to keep discussion there, to keep it in one place.

Table extraction is a common problem, and it's one that no current NLP tooling that I'm aware of handles very well. You might have to get a bit creative. I will say though, that you shouldn't need to use the NER tooling for whole paragraphs: it should be both quicker and easier to just make that a textcat task.

As I mentioned in the other thread, it will be essential to divide up the documents into tasks, while maintaining a mapping back to the original document and paragraph IDs, so you can match things up later.