How does NER labeling avoid missing labels in the database

KMLDS · December 6, 2017, 9:28pm

There is an example on the prodigy website (https://prodi.gy/docs/) that reads “Airbnb settles lawsuit with San Francisco” where “San Francisco” is labeled as a GPE. I’m assuming the other potential label of Airbnb as an organization is intentionally omitted to give a simple binary decision to the annotator and San Francisco was chosen as the higher priority target by the active learning algorithm.

What gets saved to the database in this case? I’d expect that it matters if my training data has (‘Airbnb’, "organization’) vs. (‘Airbnb’, ‘other’) for labels when I run batch training after labeling is complete.

Thanks!

honnibal · December 7, 2017, 4:38am

The NER training algorithm (and also the textcat training algorithm, actually) support missing labels. It works like this: the parser first does beam search to find the K-best parses, and then searches again subject to constraints imposed by the partial annotations. During this constrained search, the parser avoids taking any actions that would lead to annotations we know are incorrect. We also score the parses in the first group, to get two sets of parses: one set is incorrect, and another that’s correct. The weights are then updated such that more probability will be assigned to the correct parses than the incorrect parses.

You can read a short description of the latent-variable beam parsing update in my paper here: https://aclanthology.info/pdf/Q/Q14/Q14-1011.pdf (Section 4.2).

This is also how the NER algorithm learns from examples you mark incorrect. When you mark an example incorrect, there are still multiple possible correct entities — but we still have a useful constraint to use in our search.

KMLDS · December 7, 2017, 3:51pm

Thanks for the explanation! I will indeed check out that paper later tonight, it looks like a good tool to have in the belt in general.

Is there a clean way in Prodigy to handle labeling for custom models which aren’t necessarily robust to missing labels? E.g. an NER model with optional relations between entities (for concreteness let’s say I build the model in PyTorch and wrap w/ spaCy)?

honnibal · December 11, 2017, 11:44am

If your model doesn’t support missing values, I would recommend using a model to predict the missing values.

Note that there’s an important trick to this. You need to be predicting the values with a model you’re not updating. If the model gets to define its own objective, it’ll settle into a state where the solution is trivial, e.g. it never predicts any entities.

Topic		Replies	Views
How to score incompletely highlighted entities? usage , ner , solved , best-practices	2	1309	June 20, 2018
Annotating correctly using the ner.correct recipe usage , ner , solved	5	429	January 20, 2022
Help with messy data usage , ner	8	629	January 20, 2019
how to use ner.correct --update usage , ner , solved	4	549	October 21, 2021
NER train not showing one of the labels during results summary ner , spacy	4	346	June 29, 2021

How does NER labeling avoid missing labels in the database

Related Topics