Catastrophic forgetting when training NER using Prodigy

jim-brown · February 7, 2020, 6:04pm

I'm creating several new custom entities based on a corpus with approximately 20K sentences. The corpus has a high percentage of several of the entities (PERSON, ORG, etc.) recognized by en_core_web_lg, but there are a number of entities that are missing or poorly represented such as NORP, FAC, PRODUCT, EVENT, WORK_OF_ART. Prodigy NER training starting with the en_core_web_lg model works great on my new entities and the well represented ones in my corpus. Although my corpus does not include several pretrained entities now, they may appear in future as yet unseen sentences. I would like to retain spaCy's ability to find them.

How do I get examples of these entities so that the spaCy doesn't forget about them? Is there a dataset that I can add to my corpus to cover the missing entities? Is there another approach not covered by the Annotation Flowchart that I should try?

honnibal · February 11, 2020, 12:11am

The en_core_web_lg model has fairly poor recall of FAC, PRODUCT, EVENT and WORK_OF_ART, as those entity types are not well represented in the original training data. The model should predict NORP well though: that category is common in news text, it's used for demonyms such as "American", "European", "Iraqi", "British", etc. If you add more news text to your corpus you should find some examples of them.

Preparing a dataset of these under-represented entities is a good idea, but we don't currently have one. If you run the model over more text you could try to come up with a corpus for them. But if the original model still isn't predicting them often over 1 million sentences or more, I would say the model isn't actually "forgetting" anything if you update it without them --- after all, it wasn't predicting them in the first place.

Topic		Replies	Views
Looks like a new trained model has forgotten the old entities usage , ner	1	883	October 14, 2019
Generating examples in spacy to address catastrophic forgetting usage , ner , spacy , solved	8	993	January 3, 2022
New entity model ruins other entities ner , solved , best-practices	9	3908	August 16, 2018
Need more informations about catastrophic forgetting problem usage , ner	8	1803	April 20, 2018
Train NER model to improve existing entities spacy vs prodigy ner , spacy	1	956	December 9, 2019

Catastrophic forgetting when training NER using Prodigy

Related topics