Which base_model to use for ner.manual

Hi,

I have a simple workflow question hoping you could help me to clarify.

I have decided to use scispaCy's en_core_sci_scibert to train my NER model with a custom entity in heavy scientific papers. My understanding is that after I use ner.manual to create my annotated training dataset, I will use the train recipe with en_core_sci_scibert as the --base-model argument. My question is: In my first step, when I use ner.manual to create my training data, should I:

1: use a blank.en for the argument: spacy_modelOR 2: use en_core_sci_scibert for the argument: spacy_model?

I have this question because I read another post:

Likely a base model will help by providing word vectors but you'll want to turning off the ner component as your ner model will be trained from scratch.

I used to think that using blank:en for ner.manual was always better when creating a custom entity. However, after reading the above post, I am a bit hesitant. Given that I need the tokenizer and possibly other components such as the POS tagger and lemmatizer from the en_core_sci_scibert model, could you please clarify which approach I should take in my case?

Thank you so much for your clarification!

Hi @Fangjian,

I understand you want to train an NER model to recognize custom entities i.e. entities that are not covered by en_core_scibert? In that case, it is indeed better to train your NER model from scratch. Trying to add new categories to already trained model might result in unpredictable behavior as it's hard to control how the new data affects already existing weights. Especially that the pretrained categories were probably trained on a much bigger dataset. In this post Matt explains the dynamics of resuming the training of a NER component.

So in your case, you either want to use a version of en_core_scibert without the NER component as the base model for training or you want to substitute the en_core_scibert NER component with your custom NER component. You can check this example spaCy project to see how it can be done.
I'm not sure if there are other components of en_core_scibert that depend on NER predictions. If that's the case you might need to remove them as well or leave everything as is and give your NER component a different name and freeze the en_core_sci_bert during training. You can read a bit more about customizing spaCy pipelines here.

Even though you'd be training your NER component from scratch, you can still benefit from en_core_scibert tokenizer and word vectors and/or pretrained embeddings both for the training and the data annotation with ner.manual. In other words, you can benefit from transfer learning at the representation level independent of the pre-trained NER component (that you'll discard). In fact, it's very important that the same tokenizer is used during the data annotation, model training and later in production. So, in summary, your workflow should look like this:

  1. annotate using ner.manual and en_core_scibert as the base model to benefit from the scientific tokenizer (I'm assuming the labels you'd be using will be different from the en_core_scibert pretrained ones)
  2. try saving to to disk a version of en_core_scibert without the NER component and use this model as the base model for training OR write a spaCy training config where you substitute the pretrained NER with your custom NER (as in this example project)

Your answer is crystal clear and very informative. Thank you so much for your help!

1 Like