Which base_model to use for ner.manual

magdaaniol · April 28, 2025, 9:20am

I understand you want to train an NER model to recognize custom entities i.e. entities that are not covered by en_core_scibert? In that case, it is indeed better to train your NER model from scratch. Trying to add new categories to already trained model might result in unpredictable behavior as it's hard to control how the new data affects already existing weights. Especially that the pretrained categories were probably trained on a much bigger dataset. In this post Matt explains the dynamics of resuming the training of a NER component.

So in your case, you want to substitute the en_core_scibert NER component with your custom (blank) NER component. You can check this example spaCy project to see how it can be done.
I'm not sure if there are other components of en_core_scibert that depend on NER predictions. If that's the case you might need to remove them as well or leave everything as is and give your NER component a different name and freeze the en_core_sci_bert during training. You can read a bit more about customizing spaCy pipelines here.

Even though you'd be training your NER component from scratch, you can still benefit from en_core_scibert tokenizer and word vectors and/or pretrained embeddings both for the training and the data annotation with ner.manual. In other words, you can benefit from transfer learning at the representation level independent of the pre-trained NER component (that you'll discard). In fact, it's very important that the same tokenizer is used during the data annotation, model training and later in production. So, in summary, your workflow should look like this:

annotate using ner.manual and en_core_scibert as the base model to benefit from the scientific tokenizer (I'm assuming the labels you'd be using will be different from the en_core_scibert pretrained ones)
write a spaCy training config where you substitute the pretrained NER with your custom NER (as in this example project)

Topic		Replies	Views
ner.manual recipe arg -- difference between using blank:en or another spacy model usage , ner , spacy , solved	4	1306	June 8, 2022
ner.batch-train not to use default labels but just the ones from a training sample ner , spacy , solved	8	739	July 30, 2018
How do I train a custom ner model? usage , ner , spacy , solved	7	2401	June 25, 2019
Should I be using --base-model when training my model? ner , training	8	2100	May 27, 2022
Blank spacy model vs en_core_web_xx usage , ner , spacy , custom	2	883	October 25, 2021

Which base_model to use for ner.manual

Related topics