I'm creating several new custom entities based on a corpus with approximately 20K sentences. The corpus has a high percentage of several of the entities (PERSON, ORG, etc.) recognized by en_core_web_lg, but there are a number of entities that are missing or poorly represented such as NORP, FAC, PRODUCT, EVENT, WORK_OF_ART. Prodigy NER training starting with the en_core_web_lg model works great on my new entities and the well represented ones in my corpus. Although my corpus does not include several pretrained entities now, they may appear in future as yet unseen sentences. I would like to retain spaCy's ability to find them.
How do I get examples of these entities so that the spaCy doesn't forget about them? Is there a dataset that I can add to my corpus to cover the missing entities? Is there another approach not covered by the Annotation Flowchart that I should try?
The en_core_web_lg model has fairly poor recall of FAC, PRODUCT, EVENT and WORK_OF_ART, as those entity types are not well represented in the original training data. The model should predict NORP well though: that category is common in news text, it's used for demonyms such as "American", "European", "Iraqi", "British", etc. If you add more news text to your corpus you should find some examples of them.
Preparing a dataset of these under-represented entities is a good idea, but we don't currently have one. If you run the model over more text you could try to come up with a corpus for them. But if the original model still isn't predicting them often over 1 million sentences or more, I would say the model isn't actually "forgetting" anything if you update it without them --- after all, it wasn't predicting them in the first place.