Training broken after importing training data

Hi.

I’m trying to add training data I’ve programmatically converted from another source. The results of the conversion are lines in a JSONL file that have this format:

{"text": "The Iowa medical examiner's office said a woman killed by a train at an Ames crossing in March committed suicide.", "spans": [{"start": 9, "end": 25, "label": "JOB"}]}

Adding them to an existing session looks to work OK:

(prodigy) eb% prodigy db-in wiki_test auto_training.jsonl

  ✨  Imported 615 annotations for 'wiki_test' to database SQLite
  Added 'accept' answer to 615 annotations
  Session ID: 2019-04-10_15-16-12

But if I try to train, it collapses into farts:

(prodigy) eb% prodigy ner.batch-train wiki_test en_core_web_lg --output model --label JOB
Using 1 labels: JOB

Loaded model en_core_web_lg
Using 20% of accept/reject examples (898) for evaluation
Traceback (most recent call last):
  File "/misc/pyenv/versions/3.6.8/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/misc/pyenv/versions/3.6.8/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/misc/pyenv/versions/prodigy/lib/python3.6/site-packages/prodigy/__main__.py", line 331, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 211, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/misc/pyenv/versions/prodigy/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/misc/pyenv/versions/prodigy/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/misc/pyenv/versions/prodigy/lib/python3.6/site-packages/prodigy/recipes/ner.py", line 521, in batch_train
    examples = list(split_sentences(model.orig_nlp, examples))
  File "cython_src/prodigy/components/preprocess.pyx", line 37, in split_sentences
  File "cython_src/prodigy/components/preprocess.pyx", line 164, in prodigy.components.preprocess._add_tokens
KeyError: 89

I’m stuck now cos I can’t retrain.

Are you on the latest version of Prodigy? In any case, try setting --unsegmented when you run ner.batch-train to prevent it from trying to split your data into sentences (which you probably don’t need anyways).

It’s working again with --unsegmented. Thanks!

One thing though: some of the pervious training data (I’ve done some ner.manual and ner.teach sessions already) might have multi-sentence cases. Will that matter?

PS: I’m using prodigy-1.7.1-cp35.cp36.cp37-cp35m.cp36m.cp37m-linux_x86_64.whl

1 Like

Glad it works now! This should be fixed in the next version btw.

And no, it shouldn’t matter – as long as the data you’ve annotated and the data you’ve imported is somewhat consistent (and there’s not suddenly a 499293 word example in there, or something like that).

1 Like