Hi @Fangjian,
I understand you want to train an NER model to recognize custom entities i.e. entities that are not covered by en_core_scibert
? In that case, it is indeed better to train your NER model from scratch. Trying to add new categories to already trained model might result in unpredictable behavior as it's hard to control how the new data affects already existing weights. Especially that the pretrained categories were probably trained on a much bigger dataset. In this post Matt explains the dynamics of resuming the training of a NER component.
So in your case, you either want to use a version of en_core_scibert
without the NER component as the base model for training or you want to substitute the en_core_scibert
NER component with your custom NER component. You can check this example spaCy project to see how it can be done.
I'm not sure if there are other components of en_core_scibert
that depend on NER predictions. If that's the case you might need to remove them as well or leave everything as is and give your NER component a different name and freeze the en_core_sci_bert
during training. You can read a bit more about customizing spaCy pipelines here.
Even though you'd be training your NER component from scratch, you can still benefit from en_core_scibert
tokenizer and word vectors and/or pretrained embeddings both for the training and the data annotation with ner.manual
. In other words, you can benefit from transfer learning at the representation level independent of the pre-trained NER component (that you'll discard). In fact, it's very important that the same tokenizer is used during the data annotation, model training and later in production. So, in summary, your workflow should look like this:
- annotate using
ner.manual
anden_core_scibert
as the base model to benefit from the scientific tokenizer (I'm assuming the labels you'd be using will be different from theen_core_scibert
pretrained ones) - try saving to to disk a version of
en_core_scibert
without the NER component and use this model as the base model for training OR write a spaCy training config where you substitute the pretrained NER with your custom NER (as in this example project)