Can't import spacy model from prodigy train recipe output

Hello,

I'm trying to import a model I trained using the following command:
prodigy train textcat clinical_order_types en_core_web_lg --textcat-exclusive

Then when I try loading the model:
mod = spacy.load(model_location)
I'm getting: KeyError: 'PUNCTSIDE_FIN'

In the past, I've been able to run this type of construct in other prodigy projects in an earlier version of prodigy (1.8?).
The prodigy training command appears to be working, and this is the contents of the output directory:
os.listdir(model_location)
['textcat', 'ner', 'tagger', 'tokenizer', 'meta.json', 'vocab', 'parser']
Given that this is a text classification model, I don't know why there is a directory called "ner".

I read this regarding a similar error with spacy.load(): nlp=spacy.load('en_core_web_sm') KeyError: 'PUNCTSIDE_FIN' · Issue #4945 · explosion/spaCy · GitHub, but it doesn't seem to apply.

I'm using linux and have
prodigy 1.9.5
spacy 2.2.3
python 3.8

Thanks a lot,

JoAnn

Here's the contents of the meta.json produced from the prodigy train:

{"lang":"en","name":"core_web_lg","license":"MIT","author":"Explosion","url":"https://explosion.ai","email":"contact@explosion.ai","description":"English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities.","sources":[{"name":"OntoNotes 5","url":"https://catalog.ldc.upenn.edu/LDC2013T19","license":"commercial (licensed by Explosion)"},{"name":"Common Crawl"}],"pipeline":["tagger","parser","ner","textcat"],"version":"2.2.5","spacy_version":">=2.2.2","parent_package":"spacy","accuracy":{"las":90.1734441725,"uas":92.0132337105,"token_acc":99.7579930934,"tags_acc":97.2200800054,"ents_f":86.5464321721,"ents_p":86.7358967163,"ents_r":86.3577935506},"speed":{"cpu":6257.754029418,"gpu":null,"nwords":291314},"labels":{"tagger":["$","''",",","-LRB-","-RRB-",".",":","ADD","AFX","CC","CD","DT","EX","FW","HYPH","IN","JJ","JJR","JJS","LS","MD","NFP","NN","NNP","NNPS","NNS","PDT","POS","PRP","PRP$","RB","RBR","RBS","RP","SYM","TO","UH","VB","VBD","VBG","VBN","VBP","VBZ","WDT","WP","WP$","WRB","XX","_SP","``"],"parser":["ROOT","acl","acomp","advcl","advmod","agent","amod","appos","attr","aux","auxpass","case","cc","ccomp","compound","conj","csubj","csubjpass","dative","dep","det","dobj","expl","intj","mark","meta","neg","nmod","npadvmod","nsubj","nsubjpass","nummod","oprd","parataxis","pcomp","pobj","poss","preconj","predet","prep","prt","punct","quantmod","relcl","xcomp"],"ner":["CARDINAL","DATE","EVENT","FAC","GPE","LANGUAGE","LAW","LOC","MONEY","NORP","ORDINAL","ORG","PERCENT","PERSON","PRODUCT","QUANTITY","TIME","WORK_OF_ART"],"textcat":["vent_stop","iabp","vent","transfuse_blood_products","noninvasive_vent","other","chest_tube_suction"]},"vectors":{"width":300,"vectors":684831,"keys":684830,"name":"en_core_web_lg.vectors"},"factories":{"tagger":"tagger","parser":"parser","ner":"ner","textcat":"textcat"}}

Hi! This looks like the spaCy version used by Prodigy and the spaCy version you're loading your model with might be incompatible – you can run pip list in both environments to find out. If one environment is running v2.1.x and the other one is running v2.2.x, this would explain what's going on.

Awesome. I didn't realize I was loading my spacy model in a different environment. Thanks so much.

1 Like

I'm getting a similar error but I don't think it's for the same reason.

I annotated data using my own custom labels

prodigy ner.manual ner_news_headlines blank:en non_annotated_news_headlines.jsonl --label MY_PERSON,MY_ORG,MY_LOCATION

Then I trained a custom model

prodigy train ner ner_news_headlines blank:en --output "C:\Users\my_username\desktop"

I believe the above step stored the trained model as a file called "my_test_model" (originally "tokenizer" but I renamed it)
I wanted to use the new trained model (stored in the file "my_test_model") to annotate more data so I used the command

prodigy ner.correct ner_news_headlines my_test_model non_annotated_news_headlines.jsonl

However, I got the following error.

Any idea why I'm getting this error? Am I using the recipes incorrectly?

The directory specified as --output is expected to be the directory to save the model data to – for example, a directory my_test_model. When you specified C:\Users\my_username\desktop, all the model data was written to your desktop – including multiple directories for the components, the meta.json etc.

One of the directories is tokenizer, so it sounds like you renamed that and accidentally ended up only using the tokenizer directory instead of the whole model data? That's also why spaCy is confused, because it can't find any of the files it expects.

There should be multiple directories and files created on your desktop, but it's probably a bit messy now – so I'd recommend to just re-run the training and specify an empty directory as the output. This is where all the files will be saved (and you'll also be able to see all the files and directories that are created).

1 Like

Thanks, now I got it!