I have a problem with converting a trained NER model into a loadable module. I believe I'm following the steps laid out in the Get Started, First Steps and the tutorial video on training an insult recogniser.
The following is going to be slightly long and meandering, but that's mainly because I include all the output I get so you can see exactly what's happening (NB: I can see that the prodigy Support page removes tabs from my code below. I suppose you'll just have to imagine the tabs in the for and if blocks below, believe me they're there):
First, I created a dataset:
prodigy dataset eng_model "English model, version 1.0" --author MN
✨ Successfully added 'eng_model' to database SQLite.
Then, I annotated the en_core_web_sm using my own training data (for entity type 'ORG' only, just to start somewhere) and saved the annotations in the eng_model dataset:
prodigy ner.teach eng_model en_core_web_sm traindata.txt --label ORG
✨ Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!
^C('\nSaved 1415 annotations to database', 'SQLite')
('Dataset:', 'eng_model')
('Session ID:', '2017-09-25_19-30-47', '\n')
This seemed to work out just fine, as the ouput comment above hopefully shows. I went through something like 1300 examples before exhausting my training data, which took me slightly less than an hour (you've made this so easy, thanks for that!)
I then ran the batch-train script to train the model:
prodigy ner.
batch-train eng_model en_core_web_sm --output /tmp/model --eval-split 0.5 --label ORG
Loaded model en_core_web_sm
Using 50% of examples (597) for evaluation
Using 100% of remaining examples (601) for training
Dropout: 0.2 Batch size: 32 Iterations: 10
BEFORE 0.491
Correct 26
Incorrect 27
Entities 477
Unknown 205
# LOSS RIGHT WRONG ENTS SKIP ACCURACY
01 0.610 42 11 483 0 0.792
02 0.339 48 5 448 0 0.906
03 0.203 47 6 440 0 0.887
04 0.138 46 7 449 0 0.868
05 0.092 48 5 429 0 0.906
06 0.062 48 5 437 0 0.906
07 0.043 49 4 437 0 0.925
08 0.033 46 7 441 0 0.868
09 0.026 48 5 425 0 0.906
10 0.014 47 6 437 0 0.887
Correct 49
Incorrect 4
Baseline 0.491
Accuracy 0.925
Model: /tmp/model
Training data: /tmp/model/training.jsonl
Evaluation data: /tmp/model/evaluation.jsonl
So, as far as I can see, everything worked out, and the updated model was placed in /tmp/model.
To test if my updated model actually made a difference I wrote a small script (derived from one of the examples on the prodigy website):
import spacy
import en_core_web_sm
text = ''' ...(not shown)... '''
\# Print entity labels and text for the untrained model:
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print("\nEntities found before training:")
for ent in doc.ents:
if ent.label_=='ORG':
print(ent.label_, ent.text)
\# Load the trained model:
nlp = spacy.load('/tmp/model')
doc = nlp(text)
\# Print entity labels and text
print("\\nEntities found after training:")
for ent in doc.ents:
if ent.label_=='ORG':
print(ent.label_, ent.text)
The script found slightly different organisations in the input text using the en_core_web_sm and the annotated model located in /tmp/model, respectively. Also, the updated model performed slightly better than the original model, which is obviously what I'd like it to do. I interpret this to mean that the models are actually different (and that the updated model is slightly better than the original, to boot, at least for my purposes). All good so far (I think).
My problem arises when I try to save the annotated model to a loadable spacy module. I believe I'm following the guidelines to the letter (please tell me if I'm not):
spacy package /tmp/model /tmp --create-meta
Generating meta.json
Enter the package settings for your model.
Model language (default: en): en
Model name (default: model): model_TEST
Model version (default: 0.0.0):
Required spaCy version (default: >=2.0.0a14,<3.0.0):
Model description: 'Test model'
Author: MN
Author email:
Author website:
License (default: CC BY-NC 3.0):
Enter your model's pipeline components
If set to 'True', the default pipeline is used. If set to 'False', the
pipeline will be disabled. Components should be specified as a
comma-separated list of component names, e.g. tensorizer, tagger,
parser, ner. For more information, see the docs on processing pipelines.
Pipeline components (default: True):
Successfully created package 'en_model_TEST-0.0.0'
/tmp/en_model_TEST-0.0.0
To build the package, run `python setup.py sdist` in this directory.
I cd'ed to the /tmp/en_model_TEST-0.0.0 directory and from there ran the setup:
python setup.py sdist
running sdist
running egg_info
creating en_model_TEST.egg-info
writing dependency_links to en_model_TEST.egg-info/dependency_links.txt
writing requirements to en_model_TEST.egg-info/requires.txt
writing top-level names to en_model_TEST.egg-info/top_level.txt
writing en_model_TEST.egg-info/PKG-INFO
writing manifest file 'en_model_TEST.egg-info/SOURCES.txt'
reading manifest file 'en_model_TEST.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'en_model_TEST.egg-info/SOURCES.txt'
warning: sdist: standard file not found: should have one of README, README.rst, README.txt, README.md
running check
warning: check: missing required meta-data: url
warning: check: missing meta-data: if 'author' supplied, 'author_email' must be supplied too
creating en_model_TEST-0.0.0
creating en_model_TEST-0.0.0/en_model_TEST
creating en_model_TEST-0.0.0/en_model_TEST.egg-info
creating en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0
creating en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/ner
creating en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/parser
creating en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/tagger
creating en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/tensorizer
creating en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/vocab
copying files to en_model_TEST-0.0.0...
copying MANIFEST.in -> en_model_TEST-0.0.0
copying meta.json -> en_model_TEST-0.0.0
copying setup.py -> en_model_TEST-0.0.0
copying en_model_TEST/__init__.py -> en_model_TEST-0.0.0/en_model_TEST
copying en_model_TEST/meta.json -> en_model_TEST-0.0.0/en_model_TEST
copying en_model_TEST.egg-info/PKG-INFO -> en_model_TEST-0.0.0/en_model_TEST.egg-info
copying en_model_TEST.egg-info/SOURCES.txt -> en_model_TEST-0.0.0/en_model_TEST.egg-info
copying en_model_TEST.egg-info/dependency_links.txt -> en_model_TEST-0.0.0/en_model_TEST.egg-info
copying en_model_TEST.egg-info/not-zip-safe -> en_model_TEST-0.0.0/en_model_TEST.egg-info
copying en_model_TEST.egg-info/requires.txt -> en_model_TEST-0.0.0/en_model_TEST.egg-info
copying en_model_TEST.egg-info/top_level.txt -> en_model_TEST-0.0.0/en_model_TEST.egg-info
copying en_model_TEST/en_model_TEST-0.0.0/evaluation.jsonl -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0
copying en_model_TEST/en_model_TEST-0.0.0/meta.json -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0
copying en_model_TEST/en_model_TEST-0.0.0/tokenizer -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0
copying en_model_TEST/en_model_TEST-0.0.0/training.jsonl -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0
copying en_model_TEST/en_model_TEST-0.0.0/ner/cfg -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/ner
copying en_model_TEST/en_model_TEST-0.0.0/ner/lower_model -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/ner
copying en_model_TEST/en_model_TEST-0.0.0/ner/moves -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/ner
copying en_model_TEST/en_model_TEST-0.0.0/ner/tok2vec_model -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/ner
copying en_model_TEST/en_model_TEST-0.0.0/ner/upper_model -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/ner
copying en_model_TEST/en_model_TEST-0.0.0/parser/cfg -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/parser
copying en_model_TEST/en_model_TEST-0.0.0/parser/lower_model -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/parser
copying en_model_TEST/en_model_TEST-0.0.0/parser/moves -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/parser
copying en_model_TEST/en_model_TEST-0.0.0/parser/tok2vec_model -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/parser
copying en_model_TEST/en_model_TEST-0.0.0/parser/upper_model -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/parser
copying en_model_TEST/en_model_TEST-0.0.0/tagger/cfg -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/tagger
copying en_model_TEST/en_model_TEST-0.0.0/tagger/model -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/tagger
copying en_model_TEST/en_model_TEST-0.0.0/tagger/tag_map -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/tagger
copying en_model_TEST/en_model_TEST-0.0.0/tensorizer/cfg -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/tensorizer
copying en_model_TEST/en_model_TEST-0.0.0/tensorizer/model -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/tensorizer
copying en_model_TEST/en_model_TEST-0.0.0/vocab/keys -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/vocab
copying en_model_TEST/en_model_TEST-0.0.0/vocab/lexemes.bin -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/vocab
copying en_model_TEST/en_model_TEST-0.0.0/vocab/strings.json -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/vocab
copying en_model_TEST/en_model_TEST-0.0.0/vocab/vectors -> en_model_TEST-0.0.0/en_model_TEST/en_model_TEST-0.0.0/vocab
Writing en_model_TEST-0.0.0/setup.cfg
creating dist
Creating tar archive
removing 'en_model_TEST-0.0.0' (and everything under it)
So, I get a warning for not putting in my email address and website url, but I find it hard to believe that this should be the problem. Also, when I check the directory dist, the install file is there:
ls dist
en_model_TEST-0.0.0.tar.gz
I then install the model with pip:
pip install dist/en_model_TEST-0.0.0.tar.gz
Processing ./dist/en_model_TEST-0.0.0.tar.gz
Requirement already satisfied: spacy-nightly<3.0.0,>=2.0.0a14 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from en-model-TEST==0.0.0)
Requirement already satisfied: msgpack-python in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: cymem<1.32,>=1.30 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: ujson>=1.35 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: regex==2017.4.5 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: plac<1.0.0,>=0.9.6 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: dill<0.3,>=0.2 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: ftfy<5.0.0,>=4.4.2 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: preshed<2.0.0,>=1.0.0 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: msgpack-numpy in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: murmurhash<0.29,>=0.28 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: thinc<6.9.0,>=6.8.1 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: pathlib in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: numpy>=1.7 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: six in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from requests<3.0.0,>=2.13.0->spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: idna<2.7,>=2.5 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from requests<3.0.0,>=2.13.0->spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: urllib3<1.23,>=1.21.1 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from requests<3.0.0,>=2.13.0->spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: certifi>=2017.4.17 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from requests<3.0.0,>=2.13.0->spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: html5lib in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from ftfy<5.0.0,>=4.4.2->spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: wcwidth in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from ftfy<5.0.0,>=4.4.2->spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: wrapt in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from thinc<6.9.0,>=6.8.1->spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: tqdm<5.0.0,>=4.10.0 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from thinc<6.9.0,>=6.8.1->spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: cytoolz<0.9,>=0.8 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from thinc<6.9.0,>=6.8.1->spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: termcolor in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from thinc<6.9.0,>=6.8.1->spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: setuptools>=18.5 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from html5lib->ftfy<5.0.0,>=4.4.2->spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: webencodings in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from html5lib->ftfy<5.0.0,>=4.4.2->spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Requirement already satisfied: toolz>=0.8.0 in /home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages (from cytoolz<0.9,>=0.8->thinc<6.9.0,>=6.8.1->spacy-nightly<3.0.0,>=2.0.0a14->en-model-TEST==0.0.0)
Building wheels for collected packages: en-model-TEST
Running setup.py bdist_wheel for en-model-TEST ... done
Stored in directory: /home/mede/.cache/pip/wheels/aa/97/e2/468fe0e132d693852ddf090467827a936060e5c1d959a20b1f
Successfully built en-model-TEST
Installing collected packages: en-model-TEST
Successfully installed en-model-TEST-0.0.0
So, no warnings or errors there. Gives one hope, doesn't it? To test if the module is now loadable I use a slightly modified version of the script mentioned above:
import spacy
import en_core_web_sm
text = ''' ...(not shown)... '''
\# Print entity labels and text for the untrained model:
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print("\nEntities found before training:")
for ent in doc.ents:
if ent.label_=='ORG':
print(ent.label_, ent.text)
\# Load the trained model:
import en_model_TEST
nlp = spacy.load('en_model_TEST') \# doesn't work, nor does en_model_TEST.load()
\#nlp = spacy.load('en_model_TEST_0.0.0') \# doesn't work either, nor does en_model_TEST_0.0.0.load()
\#nlp = spacy.load('model_TEST') \# doesn't work either, nor does model_TEST.load()
doc = nlp(text)
\# Print entity labels and text
print("\n-------------------------------------------\nEntities found after training:")
for ent in doc.ents:
if ent.label_=='ORG':
print(ent.label_, ent.text)
This fails at the line "nlp = spacy.load('en_model_TEST')", and I get the following output:
Traceback (most recent call last):
File "TUTORIAL_use_trained_model.py", line 126, in <module>
nlp = spacy.load('en_model_TEST')
File "/home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages/spacy/__init__.py", line 13, in load
return util.load_model(name, **overrides)
File "/home/mede/Desktop/_OVERFOERES/CluedIn/NER_extraction/spacyNER/prodigy_installation/virtualenv_prodigy/lib/python3.5/site-packages/spacy/util.py", line 110, in load_model
raise IOError("Can't find model '%s'" % name)
OSError: Can't find model 'en_model_TEST'
Shouldn't spacy be able to find the trained model at this point?? Is there a step after the previous one where I should make the model discoverable for spacy / python? Have I misunderstood the procedure on some basic level? Please let me know what I've done wrong. Thanks in advance!