Extending SpaCy models with Prodigy to use detect new entity types


Our team is building a model to perform NER.

We would like to utilise SpaCy's existing NER models and categories, for example, spacy_en_core_web_lg, whilst introducing some new categories and excluding some pre-existing ones.

For example...
We would like to keep NER entity types such as GPE, ORG, LOC, FAC as we are interested in these and SpaCy models already perform well for these entities.
We are not interested in some existing entity types like LAW, WORK_OF_ART, PERCENT
We want to introduce new entity types like FORM_CODE, OCCUPATION, ROLE to the models predictions.

We have already annotated several thousand examples of data us
ing Prodigy's ner.correct recipe to bootstrap the annotation process, ensuring that our annotation is consistent with the spacy_en_core_web_lg predictions. We annotated to ensure the model (a transformer model) "learns" to identify those entities within the specific linguistic context of our specialised corpus.

Surprisingly, what we have noticed is that performance is not really good for some of the "original spacy" NER entity types, including GPE, ORG, LOC, FAC. We therefore hypothesise that we are doing something wrong, and instead of extending fine-tuning of existing SpaCy models, we are training transformer (RoBERTa) models from scratch.

Our questions are as follows:

  1. Is it possible to extend fine tuning of SpaCy models to detect new entity types whilst utilising existing SpaCy entity types?
  2. Is it possible/recommended to do so?

I have attached our config.cfg file to this post as we believe that may help. The config.cfg file was created by using the spacy init command:

spacy init config configs/${vars.config} --lang en --pipeline transformer,ner --optimize accuracy --gpu --force

Many thanks,

hi @rory-hurley-gds!

Thanks for the background and details on the experiment.

I also noticed you posted on spaCy too.

I would agree with @rmitsch's point:

we strongly recommend training your own NER model from scratch if your target labels change. One reason for that is fine-tuning may lead to catastrophic forgetting of previously learned labels. It's possible to use spaCy's rehearsal functionality to improve model stability, but retraining from scratch is the go-to approach for this as of now.

Have you seen our NER flowchart? We tried to review a similar circumstance of considering training new entity types. A few months ago, we updated the flowchart with relevant posts or documentation.

For example, in that flowchart we recommend if you're trying to add more than three new entities, you're better off to train from scratch.

As Raphael mentioned, you may be dealing with catastrophic forgetting.

I like that post because it has several ideas of ways you can overcome.

If you do want to fine-tune, I'd recommend this post from Matt:


Hope this helps!