Hello good people,
First of all, thank you for creating this fantastic tool!
I am stuck with a particular task of information extraction and hoped I can have some of your precious input.
I have a few hundred, long (5-30 pages) pdf, doc and docx project documents from which I seek to extract specific information and store them in a structured database.
NER usually seems like a good solution for this, however, the documents all have very unique structures which can also change from document to document. For instance, a lot of relevant information is stored in tables as numbers, single keywords or longer text, whereas the column names are needed to make sense of them.
As far as I understand, spaCys NER uses the surrounding words to identify entities, hence loading in the whole documents as raw text and manually tagging column keywords and phrases will probably not serve as good data.
It seems I have to do some sort of pre-processing, but I am unsure how I can automate pre-processing of different document types and structures to make it useful for Prodigy.
An example of the document structure is as follows:
Table (0.5 page)
Table (1 page)
Text (0.5 page)
Table (1 page)
Text (3 page) .... and so on.
As a last resort, I thought about just extracting the whole raw text from the documents (only removing linebreaks in python), and manually tag the information I want with ner.manual. However, I am not sure if I can “abuse” ner.manual to annotate whole paragraphs just for information extraction, especially while also using it to annotate real entities.
Also, is there a way to link the extracted information/entities back to the sentence/paragraph they are extracted from. I am asking, because often we will extract the same entity a few hundred times (for instance an organisation like “World Bank” and to make use of this information when query back, I need the surrounding context).
Many questions. My apologies. I hope you can shed some light on these issues.