When I fix the catastrophic forgetting problem by adding in entities detected by the baseline model do I have to be careful not to have the new entity spans and the old entity spans overlap?
For example, say I am trying to build an NER model that finds sports teams. I have the following sentence.
The Florida Gators won their away game in California last night.
Out of the box, spaCy will annotate “Florida” and “California” as GPEs. What I ultimately want is to keep “California” as a GPE, but label “Florida Gators” as a SPORTS_TEAM.
In my training data I’ll label “Florida Gators” as SPORTS_TEAM, but then in order to combat catastrophic forgetting I’ll run the sentence through the baseline NER, have it tell me that “California” is a GPE, and add that span to my training data. The baseline NER will also tell me that “Florida” is a GPE, and I don’t want to have that information overwrite my SPORTS_TEAM annotation.
Is there some convention that spAcy/Prodigy uses to keep this straight, or do I just have to be careful not to overlap spans when I’m augmenting my training data?