Error using synthetic dataset

MistralMistrial · January 4, 2019, 11:24pm

I have managed to create a synthetic data set, following the example in the thread Prodigy Support Training NER models with synthetic data sets. Unfortunately, I'm getting an error while running ner.batch-train:

user@user-Syntaxnet:~$ python3 -m prodigy ner.batch-train synthetic_dataset en_core_web_lg --output mymodel --label "DATE,QUANTITY_TERMS" --n-iter 40 --batch-size 8
Using 2 labels: DATE, QUANTITY_TERMS
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/prodigy/main.py", line 259, in
controller = recipe(args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 253, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/home/user/.local/lib/python3.6/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/home/user/.local/lib/python3.6/site-packages/plac_core.py", line 207, in consume
return cmd, self.func((args + varargs + extraopts), **kwargs)
File "/usr/local/lib/python3.6/dist-packages/prodigy/recipes/ner.py", line 426, in batch_train
examples = list(split_sentences(model.orig_nlp, examples))
File "cython_src/prodigy/components/preprocess.pyx", line 38, in split_sentences
File "cython_src/prodigy/components/preprocess.pyx", line 150, in prodigy.components.preprocess._add_tokens
KeyError: 8

This is how my generated data looks like:

{"text":"25 - 26 november 2018","spans":[{"text":"25 - 26 november 2018","start":0,"end":21,"label":"DATE"}],"answer":"accept"}
{"text":"26/30 nov","spans":[{"text":"26/30 nov","start":0,"end":9,"label":"DATE"}],"answer":"accept"}
{"text":"doc-no. 4278573 16/nov/2018 (fri) 17:33 (+0100) lp","spans":[{"text":"16/nov/2018","start":16,"end":27,"label":"DATE"}],"answer":"accept"}

I had no issues importing this data from the generated jsonl file into a dataset. Does ner.batch-train expect the examples to be tokenized? If so, how can I do that after generating the data (NB: I don't generate the data using Python).

ines · January 4, 2019, 11:27pm

Hi and sorry about that. Could you try disabling auto sentence-segmentation by setting the --unsegmented flag on ner.batch-train?

I think this might be related to the same issue as this one, which will be fixed in the upcoming version:

MistralMistrial · January 4, 2019, 11:33pm

Yes, that did it. Thanks for the lighting-fast response!

Topic		Replies	Views
ner.batch-train results in KeyError ner , done	2	765	January 2, 2019
KeyError: 'token_end' when trying to use ner.batch-train ner , done	9	860	June 7, 2019
KeyError: 'text' when using ner.batch-train usage , ner , solved	6	944	February 13, 2019
Training NER models with synthetic data sets usage , ner , spacy , solved	13	2960	August 26, 2019
Clearer error message for mistyped dataset name enhancement , ner , done	2	673	June 20, 2018

Error using synthetic dataset

Related topics