Best annotation strategy for NER


Sorry if this is a stupid question, but we are new to AI and I understand that we need to start annotating in the right way, to not have wrong answers in the future and need to start over.

Our scenario is that we need to do NER on contracts, to recognise the persons involved, value, dates, and some other entities specific to our business case.

My main question is that considering that 90% of the documents are 2 ou 3 pages, it's best to annotate the entire document at once, for thousands of documents or to "break" documents in minor statements and annotate this statements? It's important know this because if we can annotate the entire document, we can start now, but if we need to do some work before annotate, our work will increase.

The examples that I see is with minimal statements and my use case is a little different.

Another question that is related to the size of the text to annotate, is that if is better to have, for example, 20 labels in the SAME TEXT (For long texts) is OK or is better to "break" the text in small statements and annotate 2 or 3 labels at maximum for statement, makes difference?

Thanks in advance.

Hi! Most NER implementations, including spaCy's default NER model, typically look at a very narrow context window, e.g. a few surrounding words on either side. So there's usually not an advantage in labelling the whole text at once as opposed to sentence by sentence or paragraph by paragraph. In fact, it can sometimes be counterproductive: if you design your annotation scheme so that it needs context that's very far away, you could collect data that your model might not be able to learn from.

Splitting your text up shouldn't be too difficult. You can always use spaCy with a pretrained model or rule-based sentencizer and split the text into sentences. This will also make it much easier to annotate, because you get to move through the data faster and save more often.

1 Like