Best practices for NER annotation

Hi,

I found answers here quite insightful: Ambiguous NER annotation decisions
However, I have a little different question I couldn't find an answer to. I have to detect legal entities. E.g. article L.xx-xx or sometimes article L.xx-xx of 10 December 2xxx.
For my application just L.xx-xx is enough. Additional date of 10 December 2xxx is nice to have, but not necessary if L.xx-xx is well detected.
However, the word article is not really necessary, even though it often precedes the entity and human would naturally consider it as part of the entity. Also, if the word article is detected as part of entity, it would not harm application performance.
What is an optimal annotation strategy for the model (either Spacy or Transformers) to reliably detect legal entity in my case: include article and/or date of 10 December 2xxx, or just keep L.xx-xx (that can have varying formats, not necessary starting with L)? Considering that the false positives are preferred to the false negatives.
More broadly, if an entity has standard preceding/following tokens, does the model look enough to this standard surrounding context, or ought this context be better included into entity?
As a side note, I sometimes read best practices for NER like the entity should be less than 10 tokens. That could imply that keeping an entity as short as possible can be more beneficial for the model?

More broadly, if an entity has standard preceding/following tokens, does the model look enough to this standard surrounding context, or ought this context be better included into entity?

I do think the model will be able to condition on that context, so you won't necessarily have to annotate it as part of the entity. It's an empirical question though, so it can be hard to guess what's optimal.

One think to keep in mind is that you can always adjust the entities with rules afterwards, so it shouldn't be too difficult to make the transformation if you do decide to keep the extra tokens in the spans.

I would say that the most important consideration is whether the phrases are reasonably "atomic", that is, relatively fixed phrases with relatively little internal structure. It also matters a lot whether the phrases have clear beginning and end tokens.

The entity recogniser performs less well if you have phrases that combine freely with other phrases, as in ordinary syntax. The entity recogniser is a sequence tagger, and so it's best at recognising regular languages, rather than context-free languages.

1 Like

Thank you @honnibal
Just few more questions.

I consider adjusting annotations using regex or basic python on an annotated .jsonl. Is there a better way to post-process annotations in prodigy ?
Related to previous, I've also got annotations from several different people who sometimes used a single span for several comma separated entities, and sometimes annotated each entity separately. Entities here are just numbers. Would you have an advice on annotations post-processing in this case ?

Adjacent questions. Is there a rule of thumb on maximal number of tokens in an entity for spaCy model, if we decide to keep entities grouped in a single span?
Can grouping of comma-separated unit entities in a single span improve precision ?