Blank spacy model vs en_core_web_xx


I've been using prodigy + spacy for NER and custom NER for a few days, and been quite successful in making annotations and models - finetuning the en_core_web_lg model.

I want to try to do the same thing, but with a blank spacy model:
When using ner.teach and/or ner.manual I specify en_core_web_lg to assist me in annotation (which is awesome!). However I don't fully understand this process - after finishing annotation I prefer to transform my annotations to spacy with the "data-to-spacy" command, however if I specify a "blank:en" base model here (instead of en_core_web_lg), I get an error. If I just use data-to-spacy without a model-specification it works fine, and the training works just fine as well. However if I want to change to a blank:en base-model instead, I just don't understand how I can change it(?). I've tried putting blank:en in various places in the .cfg file that data-to-spacy made (vectors?), but then I get an error during training saying the model has no vectors - which I suppose makes sense.
Where do I specify that I want to train a new blank:en model? I trust that I won't have to re-annotate all my data, although I used the en_core_web_lg for assistance in this process.

Bonus question: I am purely trying to make cool models for custom NER, from googling a bit, I get the impression I will be more successful training a model from scratch for this purpose, instead of fine-tuning en_core_web_lg?

I'm sorry if the questions are confusing - I'm quite confused. I'm a physicist - not a data scientist.

Thanks in advance :slight_smile:


The annotation part and the export/training part can indeed be seen as largely independent steps. When you use en_core_web_lg to create better/faster NER annotations, you can use those annotations in a subsequent step to either train a new model from scratch, or build upon an existing model.

The base_model option in data-to-spacy allows you to take an existing, trained component and finetune it further with your annotations. In spaCy terms, this means that the existing component is being sourced. If however you want to train a new model from scratch, simply leave out the base_model parameter and use the lang parameter to start from a blank model with the correct language. There's also no need to put "blank:en" in the config anywhere. If the config doesn't contain any sourced components, the training loop will create & train new ones from scratch.

It really depends on your use-case and domain. The pretrained pipelines in spaCy, like en_core_web_lg have been trained on pretty standard, well-punctuated texts. If your working with texts that look much different (for instance: tweets), then it might actually make more sense to start training from scratch. However, in that case, you'll need to have a sufficient amount of training data as well!

If, on the other hand, your use-case/input texts are quite close to the original training data for the pretrained model, or if you don't have a lot of annotated data, you could consider fine-tuning/sourcing the NER component instead.

What a completely perfect answer, thanks a lot :slight_smile: