Hi @nanyasrivastav!
One important aspect to consider is whether to use a pre-trained ner
model (e.g., en_core_web_sm
) or training one from scratch (e.g., like blank:en
). Said differently, do you want to fine tune the existing entity types (e.g., ORG
, PERSON
) in an existing pipeline, add new entity types or some combination?
It sounds like you need completely new entity types and don't need the existing ner
types (at least for en_core_web_sm
). Therefore, you may be better off starting with a blank model. The one exception may be if you wanted to keep some parts of the en_core_web_sm
pipeline -- in that case, you may be better off turning off only the pre-trained ner
component while keeping the other components you may need. See this support post for more details:
Related, how do your entity types in the SciSpacy models compare to the entity types? Are they similar, completely different, or some overlap? Similarly, if they are completely different, then you may want to train from scratch ner
models. If you add new entity types and retrain without regard to your existing entity types, you may have issues of catastrophic forgetting. There are ways around this but this may be more effort than you need.
The last part to consider is that different spaCy model pipelines have different vectors (e.g., tok2vec
) too. Looking at SciSpacy, it seems like (correct me if I'm wrong) they all use the same word vectors: word2vec word vectors trained on the Pubmed Central Open Access Subset. Therefore, if your entity types are very different than those in the ner
models for SciSpacy pipelines, you may be better off training a ner
from scratch with only those vectors.
As alternative benchmarks, you could compare performance when using different sized vectors like en_core_web_sm
, en_core_web_lg
, and en_core_web_trf
(if you have interest in transformer). Just know that for the transformer vectors, you may gain performance but at the expense of speed and higher production overhead (e.g., GPU).
Also, if you had trained your own vectors elsewhere (e.g., gensim
), you could also use them. I recommend this post too because it shows how to pass in initial-vectors too.
One last detail -- for doing multiple training rounds/experiments, you can consider using spaCy config files and/or using a spaCy project. This option may take some time to learn but it would be the best for reproducibility and running multiple experiments. I've attached one of our demo projects using spaCy projects + Prodigy for a new (from scratch) ner
model.
You may also find additional ner
examples of spaCy projects in that same GitHub repo.
Let me know if this makes sense and if you have any further questions!