If I see text with no entities at all in
match modes, should I:
A. Accept, leaving no text highlighted.
B. Reject, leaving no text highlighted.
C. Ignore, leaving no text highlighted.
Plus, what’s the difference between doing A, B or C? Generally, the controls (while great for things like
ner.teach) don’t seem intuitive/well-suited to these workflows - unless I’m missing something!
This is a good question and actually a very important one. TL;DR answer: You should pretty much always accept examples with no entities if the text doesn’t actually contain any entities.
Training your model with examples of entities and examples of what’s not an entity / texts without any entities is very important – you don’t want it to overfit and “hallucinate” entities because it’s never seen a single example without entity annotations during training.
Accepting an example will include it in the training data. Ignoring an example will always exclude it from the training and evaluation data – so you should only really do that for examples that are to difficult to answer, weird, broken etc. How rejected examples are handled depends on how you’re training the model later on. If you’re training it from binary accept/reject examples, the accepted and rejected examples will help construct the best possible analysis given your feedback. You can see some examples of that in my slides here. This means that the model can be updated accordingly, even if you haven’t collected annotations about every single token. If you’re training from manual annotations and set the
--no-missing flag, spaCy will assume that the data is “gold standard” and that the annotated entities are the only entities present in the data and all other tokens are
O (outside an entity). So only the accepted answers will be used and they’ll be treated as the perfect and “final” analysis. If an accepted text has no entities highlighted, this will be interpreted as “this text has no entities”.