Hi! If you only want to runner.manual, you don’t even need the ner component in the pipeline. The model is only used for tokenization. So you can use a completely blank model, or even a pre-trained model, as long as it has the same language and tokenization rules.
Your code does seem okay, though, for creating a blank model with a blank entity recognizer. Did you double-check that the directory model-ner-en-blank you’re referending on the command line definitely exists? And if it does, can you check what’s in it?
Hi, thank you for your quick response.
Yes, the model-ner-en-blank exists. There are two documents in this folder: meta.json and tokenizer. There are also two folders: ner and vocab. The ner folder contains the following files: cfg, model and moves. The vocab folder contains: key2row, lexemes.bin, strings.json and vectors.
We have succeeded in making a blank spacy model. The strange thing is that it works with an older version of spacy 2.0.18 and not with the version I had2.1.3.
With the older version, the files that are needed are created. And now it works with this command:
Ahh, this explains a lot. Glad you got it working! And yes, Prodigy currently uses spaCy 2.0. Models between spaCy 2.0 and 2.1 aren't compatible, which is likely why you were seeing this error.
It's also the reason we're still working on testing spaCy v2.1 with Prodigy before we release the new Prodigy version that depends on spaCy v2.1 (see this thread for details). Once that's out, everyone will need to retrain their models, so we need to make sure everything works as expected.
Also, just to clarify:
Yes, I meant that you could also use the en_core_web_sm model. Its tokenization rules will be the same as the tokenization rules of the blank model, and the tokenizer is all ner.manual needs. (It pre-tokenizes the text to make it easier to highlight words because the selection can snap to the token boundaries. It also helps you spot tokenization issues and prevents you from blindly creating annotations that will never be "true" in real life because the tokenization doesn't match the entities you highlight.)
To be 100% sure, are the tokenization rules the same for all of the blank, "sm" "md" and "lg" models? (https://spacy.io/models/en, I see Token_ACC v 1.0 for all three)
And a slight nuanced question to the original poster (don't want to bombard you with another thread)—if I perform ner.manual using the MD model, do I have to train using the MD model as well? You indicate around minute 10 in the youtube video (Training a NAMED ENTITY RECOGNITION MODEL with Prodigy and Transfer Learning - YouTube) that you have to use the same pre-trained "blank:en" in order for things to work. Sorry if this is basic, I'm still very new to this world.
Yes, that's correct. The trained pipelines we provide all use the default tokenization rules included with spaCy that are also available when you create a blank nlp object (e.g. spacy.blank("en")).
If you're using ner.manual, the only thing that matters is that the tokenizer is the same: so you can use blank:en and then train a model based on en_core_web_md. What you couldn't/shouldn't do is train a model using a different tokenizer with different rules, because then you may end up with annotations that don't match the model's tokenizaton and that it can't learn from.