Hi and Happy New Year!
I’m having an issue running ner.batch-train
to train a model on a dataset I annotated using prodigy. I get the following trace:
$ prodigy ner.batch-train ner_train_0.1.5 en_core_web_sm
Traceback (most recent call last):
File "/home/matthew/.pyenv/versions/3.6.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/matthew/.pyenv/versions/3.6.6/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/matthew/.virtualenvs/prodigy_utils/lib/python3.6/site-packages/prodigy/__main__.py", line 259, in <module>
controller = recipe(*args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 253, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/home/matthew/.virtualenvs/prodigy_utils/lib/python3.6/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/home/matthew/.virtualenvs/prodigy_utils/lib/python3.6/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/home/matthew/.virtualenvs/prodigy_utils/lib/python3.6/site-packages/prodigy/recipes/ner.py", line 426, in batch_train
examples = list(split_sentences(model.orig_nlp, examples))
File "cython_src/prodigy/components/preprocess.pyx", line 38, in split_sentences
File "cython_src/prodigy/components/preprocess.pyx", line 143, in prodigy.components.preprocess._add_tokens
KeyError: 73
The annotations contain a mix of 3 standard NER entities from spaCy, and one new one that I have labelled from scratch. Could this be a tokenisation error?
Info about spaCy
spaCy version 2.0.18
prodigy version 1.6.1
Location /home/matthew/.virtualenvs/prodigy_utils/lib/python3.6/site-packages/spacy
Platform Linux-4.15.0-42-generic-x86_64-with-debian-buster-sid
Python version 3.6.6