Correct way to annotate data in my case (Spacy newbie here)

Hi,

I have never deal with NER and Spacy before so sorry if my questions are too stupid.
I want to use NER to automate data extraction from business documents like invoices. The task is quite understandably challenging in terms of both annotation and training. The dataset is huge, but it has no meaning until a consistent annotation scheme has been decided upon.

To clarify my pipeline, text is extracted from PDFs using Tesseract so, as you can imagine, there is a lot of room for OCR errors and possible typos, random characters. Entity extraction is supposed to be performed on that text to automate the creation of internal documentation.

Below are my questions:

  1. How important is text cleaning in such cases? I can already see that depending on the quality of source PDFs, texts might be really dirty. The most intuitive approach I can think of is removing unnecessary newlines, braces, special characters, etc.

  2. Documents are not very long, but definitely longer than sentences that I often encounter in tutorials on NER. Should I split documents into sentences prior to annotating and training? For some of them, it might not be obvious how to split them because the documents do not have a well-defined sentence-like structure.

  3. How do I approach fields that can consist of just a number and thus confuse the model. For example, a field like Tax Amount. It is essentially just a monetary value that can be easily understood as MONEY by pretrained models, but in my case, I would like them to be extracted as tax_amount.
    One solution I have thought of is annotating key names as well: Tax Amount in this case. So effectively I will annotate both the phrase itself (Tax Amount) and its numeric value as the tax_amount entity. This doesn't sound like a very good idea though because the model might then be confused when looking at other numeric values that can have the same format.

It would be easier if the field name and value were next to each other in the text, but this is not always the case unfortunately so sometimes they might be in different parts of the text.

On the other hand, isn't it wasteful to use NER for extracting field names when I could just use the brute-force text search to find them and then use knowledge about bounding boxes for mapping?

Thanks in advance!

Hi,

Don't worry, these are definitely good questions! I hope you'll be able to solve your problem well with Prodigy.

How important is text cleaning in such cases?

It's hard to answer in the general case. Unclean text will make it harder to make use of pretrained models, and probably increase the number of annotations you need to get to a particular accuracy. One thing to keep in mind is that OCR errors aren't evenly distributed: they affect unknown words much more often, which means names are more likely to have OCR errors than other words. That might be relevant for your situation.

Documents are not very long, but definitely longer than sentences that I often encounter in tutorials on NER. Should I split documents into sentences prior to annotating and training?

If the texts are a few hundred words, I would keep them together. We've made sure spaCy and Prodigy work on texts longer than a sentence for this sort of use-case. If they get really long (multiple pages), it's awkward to have all the text on the screen for annotation, so you might want to break them up somehow. spaCy will work fine on texts even a few thousand words long (although it might get slow). Once you hit around 10k words per text, you might run into memory problems.

How do I approach fields that can consist of just a number and thus confuse the model.

You should probably recognise these as MONEY to start with, and then have a post-process that recognises whether some MONEY is a TAX_AMOUNT. You might find this task is actually really easy, and some rules do fine on your data. But the clues to figure it out might not be convenient for the NER model to combine with the identification of the MONEY entity.

I would suggest against using NER for anything but identifying names in sentential text. If you have something more like a table, I would try to go a different route, likely with rules.