Blank spacy model without being trained

Hi,

I would like to run ner.manual with a blank spacy model. And I was wondering how to make a blank spacy model (completely empty, without being trained). Or should the model be trained?

I have tried things like:

model_nlp = spacy.blank('en')
model_nlp.add_pipe(model_nlp.create_pipe('ner'))
model_nlp.begin_training()
model_nlp.to_disk('model-ner-en-blank')

But when I use this in the command I get this error:

FileNotFoundError: [Errno 2] No such file or directory: 'model-ner-en-blank/ner/tok2vec_model'

This is the command that I want to use:

prodigy ner.manual NerTagDB model-ner-en-blank AllSentences.jsonl --label "NSAID, COAG"

Thanks,
Anne

Hi! If you only want to run ner.manual, you don’t even need the ner component in the pipeline. The model is only used for tokenization. So you can use a completely blank model, or even a pre-trained model, as long as it has the same language and tokenization rules.

Your code does seem okay, though, for creating a blank model with a blank entity recognizer. Did you double-check that the directory model-ner-en-blank you’re referending on the command line definitely exists? And if it does, can you check what’s in it?

Hi, thank you for your quick response.
Yes, the model-ner-en-blank exists. There are two documents in this folder: meta.json and tokenizer. There are also two folders: ner and vocab. The ner folder contains the following files: cfg, model and moves. The vocab folder contains: key2row, lexemes.bin, strings.json and vectors.

When I run this:

prodigy ner.manual NerTagDB AllSentences.jsonl --label "NSAID, COAG"

I get the following error:

OSError: [E053] Could not read meta.json from AllSentences.jsonl / meta.json

Or is this not what you mean by: “you don’t need the ner component”. Or do you mean that I can use, for example, en_core_web_sm.

When I run the following:

prodigy ner.manual NerTagDB model-ner-en-blank AllSentences.jsonl --label "NSAID, COAG"

I get the error again:

FileNotFoundError: [Errno 2] No such file or directory: 'model-ner-en-blank/ner/tok2vec_model'

What I want to do with ner.manual is to tag two entities (NSAID and COAG) manually. And then probably to use ner.teach and / or ner.batch-train, with the data that came out.

Hi,

We have succeeded in making a blank spacy model. The strange thing is that it works with an older version of spacy 2.0.18 and not with the version I had2.1.3.
With the older version, the files that are needed are created. And now it works with this command:

prodigy ner.manual NerTagDB model-ner-en-blank AllSentences.jsonl --label "NSAID, COAG"
1 Like

Ahh, this explains a lot. Glad you got it working! And yes, Prodigy currently uses spaCy 2.0. Models between spaCy 2.0 and 2.1 aren't compatible, which is likely why you were seeing this error.

It's also the reason we're still working on testing spaCy v2.1 with Prodigy before we release the new Prodigy version that depends on spaCy v2.1 (see this thread for details). Once that's out, everyone will need to retrain their models, so we need to make sure everything works as expected.

Also, just to clarify:

Yes, I meant that you could also use the en_core_web_sm model. Its tokenization rules will be the same as the tokenization rules of the blank model, and the tokenizer is all ner.manual needs. (It pre-tokenizes the text to make it easier to highlight words because the selection can snap to the token boundaries. It also helps you spot tokenization issues and prevents you from blindly creating annotations that will never be "true" in real life because the tokenization doesn't match the entities you highlight.)

To be 100% sure, are the tokenization rules the same for all of the blank, "sm" "md" and "lg" models? (English · spaCy Models Documentation, I see Token_ACC v 1.0 for all three)

And a slight nuanced question to the original poster (don't want to bombard you with another thread)—if I perform ner.manual using the MD model, do I have to train using the MD model as well? You indicate around minute 10 in the youtube video (https://www.youtube.com/watch?v=59BKHO_xBPA) that you have to use the same pre-trained "blank:en" in order for things to work. Sorry if this is basic, I'm still very new to this world.

Thank you!!

Yes, that's correct. The trained pipelines we provide all use the default tokenization rules included with spaCy that are also available when you create a blank nlp object (e.g. spacy.blank("en")).

If you're using ner.manual, the only thing that matters is that the tokenizer is the same: so you can use blank:en and then train a model based on en_core_web_md. What you couldn't/shouldn't do is train a model using a different tokenizer with different rules, because then you may end up with annotations that don't match the model's tokenizaton and that it can't learn from.

1 Like