How does NER labeling avoid missing labels in the database

,

There is an example on the prodigy website (https://prodi.gy/docs/) that reads “Airbnb settles lawsuit with San Francisco” where “San Francisco” is labeled as a GPE. I’m assuming the other potential label of Airbnb as an organization is intentionally omitted to give a simple binary decision to the annotator and San Francisco was chosen as the higher priority target by the active learning algorithm.

What gets saved to the database in this case? I’d expect that it matters if my training data has (‘Airbnb’, "organization’) vs. (‘Airbnb’, ‘other’) for labels when I run batch training after labeling is complete.

Thanks!

The NER training algorithm (and also the textcat training algorithm, actually) support missing labels. It works like this: the parser first does beam search to find the K-best parses, and then searches again subject to constraints imposed by the partial annotations. During this constrained search, the parser avoids taking any actions that would lead to annotations we know are incorrect. We also score the parses in the first group, to get two sets of parses: one set is incorrect, and another that’s correct. The weights are then updated such that more probability will be assigned to the correct parses than the incorrect parses.

You can read a short description of the latent-variable beam parsing update in my paper here: https://aclanthology.info/pdf/Q/Q14/Q14-1011.pdf (Section 4.2).

This is also how the NER algorithm learns from examples you mark incorrect. When you mark an example incorrect, there are still multiple possible correct entities — but we still have a useful constraint to use in our search.

Thanks for the explanation! I will indeed check out that paper later tonight, it looks like a good tool to have in the belt in general.

Is there a clean way in Prodigy to handle labeling for custom models which aren’t necessarily robust to missing labels? E.g. an NER model with optional relations between entities (for concreteness let’s say I build the model in PyTorch and wrap w/ spaCy)?

If your model doesn’t support missing values, I would recommend using a model to predict the missing values.

Note that there’s an important trick to this. You need to be predicting the values with a model you’re not updating. If the model gets to define its own objective, it’ll settle into a state where the solution is trivial, e.g. it never predicts any entities.