Hi,
I have never deal with NER and Spacy before so sorry if my questions are too stupid.
I want to use NER to automate data extraction from business documents like invoices. The task is quite understandably challenging in terms of both annotation and training. The dataset is huge, but it has no meaning until a consistent annotation scheme has been decided upon.
To clarify my pipeline, text is extracted from PDFs using Tesseract so, as you can imagine, there is a lot of room for OCR errors and possible typos, random characters. Entity extraction is supposed to be performed on that text to automate the creation of internal documentation.
Below are my questions:
-
How important is text cleaning in such cases? I can already see that depending on the quality of source PDFs, texts might be really dirty. The most intuitive approach I can think of is removing unnecessary newlines, braces, special characters, etc.
-
Documents are not very long, but definitely longer than sentences that I often encounter in tutorials on NER. Should I split documents into sentences prior to annotating and training? For some of them, it might not be obvious how to split them because the documents do not have a well-defined sentence-like structure.
-
How do I approach fields that can consist of just a number and thus confuse the model. For example, a field like Tax Amount. It is essentially just a monetary value that can be easily understood as MONEY by pretrained models, but in my case, I would like them to be extracted as tax_amount.
One solution I have thought of is annotating key names as well: Tax Amount in this case. So effectively I will annotate both the phrase itself (Tax Amount) and its numeric value as the tax_amount entity. This doesn't sound like a very good idea though because the model might then be confused when looking at other numeric values that can have the same format.
It would be easier if the field name and value were next to each other in the text, but this is not always the case unfortunately so sometimes they might be in different parts of the text.
On the other hand, isn't it wasteful to use NER for extracting field names when I could just use the brute-force text search to find them and then use knowledge about bounding boxes for mapping?
Thanks in advance!