How to score incompletely highlighted entities?

If you’re using an active learning-powered recipe, those entities are both examples of suggestions you should reject. By rejecting incorrect boundaries, you’re essentially telling the model “Nope, try again!”, moving it towards the correct boundaries. Each token can only be part of one entity, so if you accepted a partial match like “Hong”, the feedback the model would get from this is “Yep, in contexts like this, ‘Hong’ is a single-token GPE entity and wins over all other possible analyses containing this token!” That’s obviously not what you want.

If you’re labeling manually (e.g. using ner.manual or ner.make-gold), your focus would be slightly different: The dataset you produce and use for training later on should reflect the gold-standard analysis with required labels and no missing or unknown values. It’s totally fine to do this in several steps btw – in fact, we usually recommend focusing on a smaller label set when you label manually and make several passes over the data if necessary.

Prodigy is able to train from both types of annotations: accept/reject feedback on single entities where the entity labels of the rest of the text are unknown, and gold-standard annotations that describe the complete text and all available entities (or the fact that the text has no entities). The --no_missing flag on ner.batch-train lets you tell Prodigy that no entities are missing, and that your data should be treated as gold standard.

2 Likes