Blank spacy model vs en_core_web_xx

mikkelyo · October 15, 2021, 2:02pm

Hello

I've been using prodigy + spacy for NER and custom NER for a few days, and been quite successful in making annotations and models - finetuning the en_core_web_lg model.

I want to try to do the same thing, but with a blank spacy model:
When using ner.teach and/or ner.manual I specify en_core_web_lg to assist me in annotation (which is awesome!). However I don't fully understand this process - after finishing annotation I prefer to transform my annotations to spacy with the "data-to-spacy" command, however if I specify a "blank:en" base model here (instead of en_core_web_lg), I get an error. If I just use data-to-spacy without a model-specification it works fine, and the training works just fine as well. However if I want to change to a blank:en base-model instead, I just don't understand how I can change it(?). I've tried putting blank:en in various places in the .cfg file that data-to-spacy made (vectors?), but then I get an error during training saying the model has no vectors - which I suppose makes sense.
Where do I specify that I want to train a new blank:en model? I trust that I won't have to re-annotate all my data, although I used the en_core_web_lg for assistance in this process.

Bonus question: I am purely trying to make cool models for custom NER, from googling a bit, I get the impression I will be more successful training a model from scratch for this purpose, instead of fine-tuning en_core_web_lg?

I'm sorry if the questions are confusing - I'm quite confused. I'm a physicist - not a data scientist.

Thanks in advance

SofieVL · October 21, 2021, 4:12pm

Hi!

The annotation part and the export/training part can indeed be seen as largely independent steps. When you use en_core_web_lg to create better/faster NER annotations, you can use those annotations in a subsequent step to either train a new model from scratch, or build upon an existing model.

The base_model option in data-to-spacy allows you to take an existing, trained component and finetune it further with your annotations. In spaCy terms, this means that the existing component is being sourced. If however you want to train a new model from scratch, simply leave out the base_model parameter and use the lang parameter to start from a blank model with the correct language. There's also no need to put "blank:en" in the config anywhere. If the config doesn't contain any sourced components, the training loop will create & train new ones from scratch.

It really depends on your use-case and domain. The pretrained pipelines in spaCy, like en_core_web_lg have been trained on pretty standard, well-punctuated texts. If your working with texts that look much different (for instance: tweets), then it might actually make more sense to start training from scratch. However, in that case, you'll need to have a sufficient amount of training data as well!

If, on the other hand, your use-case/input texts are quite close to the original training data for the pretrained model, or if you don't have a lot of annotated data, you could consider fine-tuning/sourcing the NER component instead.

mikkelyo · October 25, 2021, 8:07am

What a completely perfect answer, thanks a lot

Topic		Replies	Views
Blank spacy model without being trained usage , ner , spacy , solved	6	3350	July 29, 2021
NER and blank models usage , ner , spacy , solved	9	3756	December 11, 2019
ner.manual recipe arg -- difference between using blank:en or another spacy model usage , ner , spacy , solved	4	1306	June 8, 2022
No pre-trained model to import when ner.batch-train usage , spacy , solved	1	498	July 16, 2019
How do I train a custom ner model? usage , ner , spacy , solved	7	2401	June 25, 2019

Blank spacy model vs en_core_web_xx

Related topics