Hello,
Our team is building a model to perform NER.
We would like to utilise SpaCy's existing NER models and categories, for example, spacy_en_core_web_lg
, whilst introducing some new categories and excluding some pre-existing ones.
For example...
We would like to keep NER entity types such as GPE
, ORG
, LOC
, FAC
as we are interested in these and SpaCy models already perform well for these entities.
We are not interested in some existing entity types like LAW
, WORK_OF_ART
, PERCENT
We want to introduce new entity types like FORM_CODE
, OCCUPATION
, ROLE
to the models predictions.
We have already annotated several thousand examples of data us
config.TXT
ing Prodigy
's ner.correct
recipe to bootstrap the annotation process, ensuring that our annotation is consistent with the spacy_en_core_web_lg
predictions. We annotated to ensure the model (a transformer model) "learns" to identify those entities within the specific linguistic context of our specialised corpus.
Surprisingly, what we have noticed is that performance is not really good for some of the "original spacy" NER entity types, including GPE
, ORG
, LOC
, FAC
. We therefore hypothesise that we are doing something wrong, and instead of extending fine-tuning of existing SpaCy models, we are training transformer (RoBERTa) models from scratch.
Our questions are as follows:
- Is it possible to extend fine tuning of SpaCy models to detect new entity types whilst utilising existing SpaCy entity types?
- Is it possible/recommended to do so?
I have attached our config.cfg
file to this post as we believe that may help. The config.cfg file was created by using the spacy init command:
spacy init config configs/${vars.config} --lang en --pipeline transformer,ner --optimize accuracy --gpu --force
Many thanks,
@rory-hurley-gds