Error using synthetic dataset

I have managed to create a synthetic data set, following the example in the thread Prodigy Support Training NER models with synthetic data sets. Unfortunately, I'm getting an error while running ner.batch-train:

user@user-Syntaxnet:~$ python3 -m prodigy ner.batch-train synthetic_dataset en_core_web_lg --output mymodel --label "DATE,QUANTITY_TERMS" --n-iter 40 --batch-size 8
Using 2 labels: DATE, QUANTITY_TERMS
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/prodigy/main.py", line 259, in
controller = recipe(args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 253, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/home/user/.local/lib/python3.6/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/home/user/.local/lib/python3.6/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File "/usr/local/lib/python3.6/dist-packages/prodigy/recipes/ner.py", line 426, in batch_train
examples = list(split_sentences(model.orig_nlp, examples))
File "cython_src/prodigy/components/preprocess.pyx", line 38, in split_sentences
File "cython_src/prodigy/components/preprocess.pyx", line 150, in prodigy.components.preprocess._add_tokens
KeyError: 8

This is how my generated data looks like:

{"text":"25 - 26 november 2018","spans":[{"text":"25 - 26 november 2018","start":0,"end":21,"label":"DATE"}],"answer":"accept"}
{"text":"26/30 nov","spans":[{"text":"26/30 nov","start":0,"end":9,"label":"DATE"}],"answer":"accept"}
{"text":"doc-no. 4278573 16/nov/2018 (fri) 17:33 (+0100) lp","spans":[{"text":"16/nov/2018","start":16,"end":27,"label":"DATE"}],"answer":"accept"}

I had no issues importing this data from the generated jsonl file into a dataset. Does ner.batch-train expect the examples to be tokenized? If so, how can I do that after generating the data (NB: I don't generate the data using Python).

Hi and sorry about that. Could you try disabling auto sentence-segmentation by setting the --unsegmented flag on ner.batch-train?

I think this might be related to the same issue as this one, which will be fixed in the upcoming version:

Yes, that did it. Thanks for the lighting-fast response!

1 Like