When I fix the catastrophic forgetting problem by adding in entities detected by the baseline model do I have to be careful not to have the new entity spans and the old entity spans overlap?
For example, say I am trying to build an NER model that finds sports teams. I have the following sentence.
The Florida Gators won their away game in California last night.
Out of the box, spaCy will annotate “Florida” and “California” as GPEs. What I ultimately want is to keep “California” as a GPE, but label “Florida Gators” as a SPORTS_TEAM.
In my training data I’ll label “Florida Gators” as SPORTS_TEAM, but then in order to combat catastrophic forgetting I’ll run the sentence through the baseline NER, have it tell me that “California” is a GPE, and add that span to my training data. The baseline NER will also tell me that “Florida” is a GPE, and I don’t want to have that information overwrite my SPORTS_TEAM annotation.
Is there some convention that spAcy/Prodigy uses to keep this straight, or do I just have to be careful not to overlap spans when I’m augmenting my training data?
The entity recognizer is constrained to predict only non-overlapping, non-nested spans. The training data should obey the same constraint. If you like, you could have two sentences with the different annotations in your data. I’m not sure whether this would hurt or help your performance, though.
If you want spaCy to learn to recover both annotations, you could have two EntityRecognizer instances in the pipeline. You would need to move the entity annotations into an extension attribute, because you don’t want the second entity recogniser to overwrite the entities set by the first one. Something like this should work:
from spacy.tokens import Doc
Doc.set_extension('my_ents', default=None)
def move_ents_to_attr(doc):
if doc._.my_ents is None:
doc._.my_ents = []
doc._.my_ents.extend(doc.ents)
doc.ents = []
return doc
nlp = spacy.load('en')
nlp.entity.postprocess.append(move_ents_to_attr)
What if I have more than two NER models? Do I need to create an extension attribute for each of them? Or should I code it differently to handle unknown number of models in my NER pipeline?
You would need to have an extension attribute to hold the spans, yes. Internally spaCy encodes the entity annotations using IOB-style data, so there's no way to represent overlapping entities on the built-in token data.
Unlike in doc.ents, overlapping matches are allowed in doc.spans,
so no filtering is required, but optional filtering and sorting can be applied
to the spans before they’re saved.
Can I use this to create training data for my spacy model?
If yes, how? Because DocBin doesnt accept overlapping spans.
This note in Prodigy span categorization docs contains just the info you need, I think.
You should consider training spaCy SpanCategorizer
Did data-to-spacy with --spancat dataset did not work for you?