Hi - I’m wondering how lenient I should be with the ner.teach scoring. As an example, for an organization found in the text: “Biocarbon Amalgamate LLC”, if just one or two words is tagged as ORG, should I accept this, reject it, or pass on it and hold out for a better guess later? Another example I am seeing is partial place names, such as just the 'Hong" in “Hong Kong” being tagged as GPE… I am also using manual tagging to improve the results but am never sure how to handle these “partial” hits when they come up in ner.teach. Thanks!
If you’re using an active learning-powered recipe, those entities are both examples of suggestions you should reject. By rejecting incorrect boundaries, you’re essentially telling the model “Nope, try again!”, moving it towards the correct boundaries. Each token can only be part of one entity, so if you accepted a partial match like “Hong”, the feedback the model would get from this is “Yep, in contexts like this, ‘Hong’ is a single-token GPE entity and wins over all other possible analyses containing this token!” That’s obviously not what you want.
If you’re labeling manually (e.g. using ner.manual
or ner.make-gold
), your focus would be slightly different: The dataset you produce and use for training later on should reflect the gold-standard analysis with required labels and no missing or unknown values. It’s totally fine to do this in several steps btw – in fact, we usually recommend focusing on a smaller label set when you label manually and make several passes over the data if necessary.
Prodigy is able to train from both types of annotations: accept/reject feedback on single entities where the entity labels of the rest of the text are unknown, and gold-standard annotations that describe the complete text and all available entities (or the fact that the text has no entities). The --no_missing
flag on ner.batch-train
lets you tell Prodigy that no entities are missing, and that your data should be treated as gold standard.
Thanks, Ines – I appreciate the response and the fact that you went beyond just answering the question and into Best Practices. Warming up my clicking finger for a session now.