When I fix the catastrophic forgetting problem by adding in entities detected by the baseline model do I have to be careful not to have the new entity spans and the old entity spans overlap?
For example, say I am trying to build an NER model that finds sports teams. I have the following sentence.
The Florida Gators won their away game in California last night.
Out of the box, spaCy will annotate “Florida” and “California” as GPEs. What I ultimately want is to keep “California” as a GPE, but label “Florida Gators” as a SPORTS_TEAM.
In my training data I’ll label “Florida Gators” as SPORTS_TEAM, but then in order to combat catastrophic forgetting I’ll run the sentence through the baseline NER, have it tell me that “California” is a GPE, and add that span to my training data. The baseline NER will also tell me that “Florida” is a GPE, and I don’t want to have that information overwrite my SPORTS_TEAM annotation.
Is there some convention that spAcy/Prodigy uses to keep this straight, or do I just have to be careful not to overlap spans when I’m augmenting my training data?
The entity recognizer is constrained to predict only non-overlapping, non-nested spans. The training data should obey the same constraint. If you like, you could have two sentences with the different annotations in your data. I’m not sure whether this would hurt or help your performance, though.
If you want spaCy to learn to recover both annotations, you could have two
EntityRecognizer instances in the pipeline. You would need to move the entity annotations into an extension attribute, because you don’t want the second entity recogniser to overwrite the entities set by the first one. Something like this should work:
from spacy.tokens import Doc
if doc._.my_ents is None:
doc._.my_ents = 
doc.ents = 
nlp = spacy.load('en')
What if I have more than two NER models? Do I need to create an extension attribute for each of them? Or should I code it differently to handle unknown number of models in my NER pipeline?
You would need to have an extension attribute to hold the spans, yes. Internally spaCy encodes the entity annotations using IOB-style data, so there's no way to represent overlapping entities on the built-in token data.
Tried to implement the code above but got the following error
AttributeError: 'spacy.pipeline.pipes.EntityRecognizer' object has no attribute 'postprocess'
I cant seem to find anything on
postprocess in the spacy docs
@mkallen My mistake, sorry. Just add the function to the pipeline with