Which base_model to use for ner.manual

Hi @Fangjian,

I understand you want to train an NER model to recognize custom entities i.e. entities that are not covered by en_core_scibert? In that case, it is indeed better to train your NER model from scratch. Trying to add new categories to already trained model might result in unpredictable behavior as it's hard to control how the new data affects already existing weights. Especially that the pretrained categories were probably trained on a much bigger dataset. In this post Matt explains the dynamics of resuming the training of a NER component.

So in your case, you either want to use a version of en_core_scibert without the NER component as the base model for training or you want to substitute the en_core_scibert NER component with your custom NER component. You can check this example spaCy project to see how it can be done.
I'm not sure if there are other components of en_core_scibert that depend on NER predictions. If that's the case you might need to remove them as well or leave everything as is and give your NER component a different name and freeze the en_core_sci_bert during training. You can read a bit more about customizing spaCy pipelines here.

Even though you'd be training your NER component from scratch, you can still benefit from en_core_scibert tokenizer and word vectors and/or pretrained embeddings both for the training and the data annotation with ner.manual. In other words, you can benefit from transfer learning at the representation level independent of the pre-trained NER component (that you'll discard). In fact, it's very important that the same tokenizer is used during the data annotation, model training and later in production. So, in summary, your workflow should look like this:

  1. annotate using ner.manual and en_core_scibert as the base model to benefit from the scientific tokenizer (I'm assuming the labels you'd be using will be different from the en_core_scibert pretrained ones)
  2. try saving to to disk a version of en_core_scibert without the NER component and use this model as the base model for training OR write a spaCy training config where you substitute the pretrained NER with your custom NER (as in this example project)