New entity model ruins other entities

I’m trying to train a new entity of type “Technology”.
I started with some seed terms, created patterns and annotated ~300 examples. Then I batch-trained them with en_core_web_lg as a baseline model.
Although I don’t have a lot of annotations yet, I wanted to see how the model is doing so far.

It seems that the model mislabels a lot of entities and tends to label non-entities as WORK_OF_ART:

import spacy
nlp = spacy.load('tech-model')
doc = nlp('Blockchain is a kind of technology')
[(ent.text, ent.label_) for ent in doc.ents]
> [('Blockchain', 'TECH')]

doc = nlp('Apple is looking at buying U.K. startup for $1 billion')
[(ent.text, ent.label_) for ent in doc.ents]
>[('Apple', 'LAW'), ('is', 'WORK_OF_ART'), ('looking', 'WORK_OF_ART'), ('at', 'WORK_OF_ART'), ('buying', 'WORK_OF_ART'), ('startup', 'WORK_OF_ART'), ('for', 'WORK_OF_ART'), ('$1 billion', 'MONEY')]

nlp = spacy.load('en_core_web_lg')
doc = nlp('Apple is looking at buying U.K. startup for $1 billion')
[(ent.text, ent.label_) for ent in doc.ents]
>[('Apple', 'ORG'), ('U.K,', 'GPE'), ('$1 billion', 'MONEY')]

This happened also for another entity I trained.
What am I doing wrong? :slight_smile:

Thanks a lot!

1 Like

Sorry about the late reply! I think what you’re experiencing might be whats often referred to as the “catastrophic forgetting problem”. As your model is learning about the new entity type, it’s “forgetting” what it has previously learned. In your example, this is pretty significant – but it might be because you’ve trained a completely new entity, so the only data the model is updating on is examples labelled TECH and none of the other entity types. Because the model is never “reminded” about the other types, it overfits on the new data.

This blog post we’ve published has some more background on this, including strategies to prevent it. One approach is to mix in examples that the model previously got right and train on both those examples and the new examples.

This is pretty easy to do in Prodigy – after collecting annotations for your new TECH entity, run the model on the same input text, and annotate the other labels. You can add all annotations to the same dataset, and then train your model with those examples. Make sure to always use input data that’s similar to what the model have to process at runtime. This might also give you a little boost in accuracy over the standard English model, because you’re also improving the existing entity types on your specific data.

Alternatively, you can also generate those examples yourself, by running spaCy over a bunch of text and selecting the entities you care about the most. (See the PRODIGY_README.html for details on the JSONL format – all you need to do is convert the entity annotations to this format and then import them to your dataset using the db-in command.)

1 Like

After reading your reply and this blog post I totally agree that this is probably a “catastrophic forgetting problem” case.
Thanks for taking the time to write a detailed answer!

1 Like

I hit a similar catastrophic forgetting problem that also ended up identifying pretty much every word as a WORK_OF_ART. It seems odd that the model would consistently gravitate towards that particular label.

@McNeill Wild Thought here… Perhaps WORK_OF_ART is the least accurately trained label and is next to O. Since new entities can take a pie of weights of O, WORK_OF_ART becomes kind of a bucket all when the score of other entities (old and new) is low.

A post was merged into an existing topic: Run through python script.

Here’s a story I’ve made up that helps me understand catastrophic forgetting.

You and another person are standing in a room with a long table. The person says to you, “Go stand at the right-hand end of the table” so you go stand at the right-hand end of the table. Then they say to you “Go stand at the left-hand end of the table” so you go stand at the left-hand end of the table, and for some reason the other person looks unhappy.

“You told me to go stand at the left-hand end of the table and I did,” you say. “What’s the problem?”

“Well I know I told you to go stand at the left-hand end of the table,” they reply, “but I was hoping you’d end up somewhere more in the middle. It’s like you forgot the whole standing at the right-hand end part you did first.”

“I didn’t forget anything,” you reply. “I just did what you asked me to do.”

3 Likes

I'm facing exactly the same problem of 'catastrophic forgetting', where after training a model with new entity type, all other entities have been overridden. I understand the process described by Ines, but a bit confused how to implement:

run the model on the same input text, and annotate the other labels.

Any chance to help with a couple of lines to make it clear? Thanks!

@Andrey A very simple solution would be to use spaCy, load the model you want to update later and process a bunch of sentences with it. You can then extract the existing entity spans, export them in the same format as your other annotations. Once you're done, mix in your new annotations and train the model on the complete data.

Here's a minimal example and implementation idea:

nlp = spacy.load('en_core_web_sm')
examples = []  # save this out later

for doc in nlp.pipe(LIST_OF_YOUR_TEXTS):
    # get all existing entity spans with start, end and label
    spans = [{'start': ent.start_char, 'end': ent.end_char,
               'label': ent.label_} for ent in doc.ents]
    examples.append({'text': doc.text, 'spans': spans})

Of course, not all of the predictions are going to be correct, so you likely want to remove the bad ones. You could do this by hand or use Prodigy's mark recipe to just stream in the data and say yes or no to each span. So instead of creating one example with all spans, you could also create one example per span:

for doc in nlp.pipe(LIST_OF_YOUR_TEXTS):
    for ent in doc.ents:
        span = {'start': ent.start_char, 'end': ent.end_char, 'label': ent.label}
        examples.append({'text': doc.text, 'spans': [span]})

Prodigy's ner.make-gold implements the same idea: you get to see what the model currently predicts and you get to make edits and add new annotations. So your final training data will include both the new entities, as well as the old ones that the model previously got correct.

The training recipes in the latest version of Prodigy now also support a --no-missing flag that lets you specify that all annotations are complete and should be treated as gold standard. While the regular training process assumes that non-annotated tokens are missing values (to allow training from single entity spans and binary decisions), training with the --no-missing flag will treat all other tokens as O entities (outside an entity). So if you know that your training examples cover all entities that are present in the data, this can give you another boost in accuracy.

Finally, you might also find this thread useful, which discusses an approach to mix in examples from the model's original training data (in this case, spaCy's English models):

Hi Ines,

Fantastic! I’m reading now through the forum and trying various things. Thanks again for your detailed answer!