For the "spacy_model" argument in the ner.manual recipe, what would be the difference between using "blank:en" or "en_core_web_sm" (or any other trained spacy pipeline)?
will it affect my training process and its accuracy/ efficiency at a later stage? on what basis do I make a decision during the annotation stage? my task is to annotate an "entity" on a bunch of biomedical abstracts and then train the annotated data using different base models (eg: the 8 scispacy models available scispacy | SpaCy models for biomedical text processing) and figure out which model gives better results.
In a gist; I want to keep my annotated data (file) consistent and train using different models (I want to avoid repeating the annotation task because of time constraints), so I want to figure out the right command-line arguments to use for ner. manual recipe.
One important aspect to consider is whether to use a pre-trained ner model (e.g., en_core_web_sm) or training one from scratch (e.g., like blank:en). Said differently, do you want to fine tune the existing entity types (e.g., ORG, PERSON) in an existing pipeline, add new entity types or some combination?
It sounds like you need completely new entity types and don't need the existing ner types (at least for en_core_web_sm). Therefore, you may be better off starting with a blank model. The one exception may be if you wanted to keep some parts of the en_core_web_sm pipeline -- in that case, you may be better off turning off only the pre-trained ner component while keeping the other components you may need. See this support post for more details:
Related, how do your entity types in the SciSpacy models compare to the entity types? Are they similar, completely different, or some overlap? Similarly, if they are completely different, then you may want to train from scratch ner models. If you add new entity types and retrain without regard to your existing entity types, you may have issues of catastrophic forgetting. There are ways around this but this may be more effort than you need.
The last part to consider is that different spaCy model pipelines have different vectors (e.g., tok2vec) too. Looking at SciSpacy, it seems like (correct me if I'm wrong) they all use the same word vectors: word2vec word vectors trained on the Pubmed Central Open Access Subset. Therefore, if your entity types are very different than those in the ner models for SciSpacy pipelines, you may be better off training a ner from scratch with only those vectors.
Also, if you had trained your own vectors elsewhere (e.g., gensim), you could also use them. I recommend this post too because it shows how to pass in initial-vectors too.
One last detail -- for doing multiple training rounds/experiments, you can consider using spaCy config files and/or using a spaCy project. This option may take some time to learn but it would be the best for reproducibility and running multiple experiments. I've attached one of our demo projects using spaCy projects + Prodigy for a new (from scratch) ner model.
You may also find additional ner examples of spaCy projects in that same GitHub repo.
Let me know if this makes sense and if you have any further questions!
Thank you for the very detailed explanation @ryanwesslen! Your response was extremely helpful and has given me a much better sense of direction to work with.
I had a quick follow-up question regarding what you have mentioned here.
By "other components" you mean pipeline components like tagger, parser, lemmatizer etc. right? Do these affect the training process for a NER model? That is, will my model's ability to predict a named entity depend on how well the model tags POS terms? (or something along those lines with the other components in the pipeline?) Or is it independent and I don't really need to consider the exception?
A while ago I read the documentation on https://spacy.io/models, under the section "overview > pipeline design", which states
" When modifying a trained pipeline, it’s important to understand how the components depend on each other. Unlike spaCy v2, where the tagger , parser and ner components were all independent, some v3 components depend on earlier components in the pipeline. As a result, disabling or reordering components can affect the annotation quality or lead to warnings and errors."
On what basis do I decide if I wanted to keep some components of the pipeline and disable others? This is what my project is focusing on right now -- "automating data extraction of nanoparticle entities from a bunch of related biomedical titles and abstracts". So NER is really the only thing I have been looking into till now.
What is shared are the embeddings / tok2vec layer (via the listener). You can read more about this here:
So you're right that you may need to be careful disabling components on the pipelines and then retraining. But that would be the case when you want to modify existing model components like modify the pretrained ner model. However, if you're starting with a blank ner, which seems you may want to since you have new entity types, you can ignore the other components and train from scratch. The key thing you'd be modifying in each experiment is a different word vectors / tok2vec to see if they provide additional model lift.
Let me know if this helps or if you have any other questions!