Blank spacy model without being trained

AvE · April 23, 2019, 2:43pm

Hi,

I would like to run ner.manual with a blank spacy model. And I was wondering how to make a blank spacy model (completely empty, without being trained). Or should the model be trained?

I have tried things like:

model_nlp = spacy.blank('en')
model_nlp.add_pipe(model_nlp.create_pipe('ner'))
model_nlp.begin_training()
model_nlp.to_disk('model-ner-en-blank')

But when I use this in the command I get this error:

FileNotFoundError: [Errno 2] No such file or directory: 'model-ner-en-blank/ner/tok2vec_model'

This is the command that I want to use:

prodigy ner.manual NerTagDB model-ner-en-blank AllSentences.jsonl --label "NSAID, COAG"

Thanks,
Anne

ines · April 23, 2019, 3:45pm

Hi! If you only want to run ner.manual, you don’t even need the ner component in the pipeline. The model is only used for tokenization. So you can use a completely blank model, or even a pre-trained model, as long as it has the same language and tokenization rules.

Your code does seem okay, though, for creating a blank model with a blank entity recognizer. Did you double-check that the directory model-ner-en-blank you’re referending on the command line definitely exists? And if it does, can you check what’s in it?

AvE · April 24, 2019, 7:36am

Hi, thank you for your quick response.
Yes, the model-ner-en-blank exists. There are two documents in this folder: meta.json and tokenizer. There are also two folders: ner and vocab. The ner folder contains the following files: cfg, model and moves. The vocab folder contains: key2row, lexemes.bin, strings.json and vectors.

When I run this:

prodigy ner.manual NerTagDB AllSentences.jsonl --label "NSAID, COAG"

I get the following error:

OSError: [E053] Could not read meta.json from AllSentences.jsonl / meta.json

Or is this not what you mean by: “you don’t need the ner component”. Or do you mean that I can use, for example, en_core_web_sm.

When I run the following:

prodigy ner.manual NerTagDB model-ner-en-blank AllSentences.jsonl --label "NSAID, COAG"

I get the error again:

FileNotFoundError: [Errno 2] No such file or directory: 'model-ner-en-blank/ner/tok2vec_model'

What I want to do with ner.manual is to tag two entities (NSAID and COAG) manually. And then probably to use ner.teach and / or ner.batch-train, with the data that came out.

AvE · April 24, 2019, 7:51am

Hi,

We have succeeded in making a blank spacy model. The strange thing is that it works with an older version of spacy 2.0.18 and not with the version I had2.1.3.
With the older version, the files that are needed are created. And now it works with this command:

prodigy ner.manual NerTagDB model-ner-en-blank AllSentences.jsonl --label "NSAID, COAG"

ines · April 24, 2019, 10:06am

Ahh, this explains a lot. Glad you got it working! And yes, Prodigy currently uses spaCy 2.0. Models between spaCy 2.0 and 2.1 aren't compatible, which is likely why you were seeing this error.

It's also the reason we're still working on testing spaCy v2.1 with Prodigy before we release the new Prodigy version that depends on spaCy v2.1 (see this thread for details). Once that's out, everyone will need to retrain their models, so we need to make sure everything works as expected.

Also, just to clarify:

Yes, I meant that you could also use the en_core_web_sm model. Its tokenization rules will be the same as the tokenization rules of the blank model, and the tokenizer is all ner.manual needs. (It pre-tokenizes the text to make it easier to highlight words because the selection can snap to the token boundaries. It also helps you spot tokenization issues and prevents you from blindly creating annotations that will never be "true" in real life because the tokenization doesn't match the entities you highlight.)

aball123 · July 29, 2021, 3:45am

To be 100% sure, are the tokenization rules the same for all of the blank, "sm" "md" and "lg" models? (English · spaCy Models Documentation, I see Token_ACC v 1.0 for all three)

And a slight nuanced question to the original poster (don't want to bombard you with another thread)—if I perform ner.manual using the MD model, do I have to train using the MD model as well? You indicate around minute 10 in the youtube video (https://www.youtube.com/watch?v=59BKHO_xBPA) that you have to use the same pre-trained "blank:en" in order for things to work. Sorry if this is basic, I'm still very new to this world.

Thank you!!

ines · July 29, 2021, 11:26pm

Yes, that's correct. The trained pipelines we provide all use the default tokenization rules included with spaCy that are also available when you create a blank nlp object (e.g. spacy.blank("en")).

If you're using ner.manual, the only thing that matters is that the tokenizer is the same: so you can use blank:en and then train a model based on en_core_web_md. What you couldn't/shouldn't do is train a model using a different tokenizer with different rules, because then you may end up with annotations that don't match the model's tokenizaton and that it can't learn from.

Topic		Replies	Views
How do I train a custom ner model? usage , ner , spacy , solved	7	2392	June 25, 2019
NER and blank models usage , ner , spacy , solved	9	3745	December 11, 2019
Problem running the ner.manual example given in the docs usage	1	387	January 5, 2020
Blank spacy model vs en_core_web_xx usage , ner , spacy , custom	2	876	October 25, 2021
Trying to teach NER from blank model for Russian language ner , spacy , solved	3	3199	August 8, 2018

Blank spacy model without being trained

Related topics