ner.batch-train results in KeyError

ner
to-be-released

(Matt Upson) #1

Hi and Happy New Year!

I’m having an issue running ner.batch-train to train a model on a dataset I annotated using prodigy. I get the following trace:

$ prodigy ner.batch-train ner_train_0.1.5 en_core_web_sm
Traceback (most recent call last):
  File "/home/matthew/.pyenv/versions/3.6.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/matthew/.pyenv/versions/3.6.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/matthew/.virtualenvs/prodigy_utils/lib/python3.6/site-packages/prodigy/__main__.py", line 259, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 253, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/home/matthew/.virtualenvs/prodigy_utils/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/matthew/.virtualenvs/prodigy_utils/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/matthew/.virtualenvs/prodigy_utils/lib/python3.6/site-packages/prodigy/recipes/ner.py", line 426, in batch_train
    examples = list(split_sentences(model.orig_nlp, examples))
  File "cython_src/prodigy/components/preprocess.pyx", line 38, in split_sentences
  File "cython_src/prodigy/components/preprocess.pyx", line 143, in prodigy.components.preprocess._add_tokens
KeyError: 73

The annotations contain a mix of 3 standard NER entities from spaCy, and one new one that I have labelled from scratch. Could this be a tokenisation error?


Info about spaCy

spaCy version      2.0.18         
prodigy version    1.6.1
Location           /home/matthew/.virtualenvs/prodigy_utils/lib/python3.6/site-packages/spacy
Platform           Linux-4.15.0-42-generic-x86_64-with-debian-buster-sid
Python version     3.6.6

Error using synthetic dataset
(Matt Upson) #2

Ahh just realised that using --unsegmented solves the problem.


(Ines Montani) #3

Thanks for the update and sorry about that – I’m pretty sure we already fixed this problem for the upcoming release!