New entity model ruins other entities

ines · August 16, 2018, 3:19pm

@Andrey A very simple solution would be to use spaCy, load the model you want to update later and process a bunch of sentences with it. You can then extract the existing entity spans, export them in the same format as your other annotations. Once you're done, mix in your new annotations and train the model on the complete data.

Here's a minimal example and implementation idea:

nlp = spacy.load('en_core_web_sm')
examples = []  # save this out later

for doc in nlp.pipe(LIST_OF_YOUR_TEXTS):
    # get all existing entity spans with start, end and label
    spans = [{'start': ent.start_char, 'end': ent.end_char,
               'label': ent.label_} for ent in doc.ents]
    examples.append({'text': doc.text, 'spans': spans})

Of course, not all of the predictions are going to be correct, so you likely want to remove the bad ones. You could do this by hand or use Prodigy's mark recipe to just stream in the data and say yes or no to each span. So instead of creating one example with all spans, you could also create one example per span:

for doc in nlp.pipe(LIST_OF_YOUR_TEXTS):
    for ent in doc.ents:
        span = {'start': ent.start_char, 'end': ent.end_char, 'label': ent.label}
        examples.append({'text': doc.text, 'spans': [span]})

Prodigy's ner.make-gold implements the same idea: you get to see what the model currently predicts and you get to make edits and add new annotations. So your final training data will include both the new entities, as well as the old ones that the model previously got correct.

The training recipes in the latest version of Prodigy now also support a --no-missing flag that lets you specify that all annotations are complete and should be treated as gold standard. While the regular training process assumes that non-annotated tokens are missing values (to allow training from single entity spans and binary decisions), training with the --no-missing flag will treat all other tokens as O entities (outside an entity). So if you know that your training examples cover all entities that are present in the data, this can give you another boost in accuracy.

Finally, you might also find this thread useful, which discusses an approach to mix in examples from the model's original training data (in this case, spaCy's English models):

Topic		Replies	Views
Looks like a new trained model has forgotten the old entities usage , ner	1	753	October 14, 2019
Catastrophic forgetting when training NER using Prodigy ner , spacy	1	484	February 11, 2020
Generating examples in spacy to address catastrophic forgetting usage , ner , spacy , solved	8	837	January 3, 2022
Train NER model to improve existing entities spacy vs prodigy ner , spacy	1	908	December 9, 2019
Improving on spacy's existing NER entities ner	1	616	December 5, 2019

New entity model ruins other entities

Related Topics