I am training an NER model to detect two custom entity types JURISDICTION
and SERVICE_CONSUMER
. I have training data in a dataset called entity.dataset
with annotations for both these types. I run the following Prodigy command
prodigy ner.batch-train entity.dataset en_core_web_lg --n-iter 5 --output-model model --label JURISDICTION
and get this crash stack before I see output for the first iteration
['U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-QUANTITY', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'B-ORG', 'L-ORG', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-PRODUCT', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-EVENT', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART']Traceback (most recent call last):
File "/usr/lib64/python3.5/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib64/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/pyenv/environments/ecms/lib/python3.5/site-packages/prodigy/__main__.py", line 242, in <module>
controller = recipe(*args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 150, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/pyenv/environments/ecms/lib/python3.5/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/pyenv/environments/ecms/lib/python3.5/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/pyenv/environments/ecms/lib/python3.5/site-packages/prodigy/recipes/ner.py", line 401, in batch_train
drop=dropout, beam_width=beam_width)
File "cython_src/prodigy/models/ner.pyx", line 304, in prodigy.models.ner.EntityRecognizer.batch_train
File "cython_src/prodigy/models/ner.pyx", line 365, in prodigy.models.ner.EntityRecognizer._update
File "cython_src/prodigy/models/ner.pyx", line 359, in prodigy.models.ner.EntityRecognizer._update
File "cython_src/prodigy/models/ner.pyx", line 360, in prodigy.models.ner.EntityRecognizer._update
File "/pyenv/environments/ecms/lib/python3.5/site-packages/spacy/language.py", line 407, in update
proc.update(docs, golds, drop=drop, sgd=get_grads, losses=losses)
File "nn_parser.pyx", line 558, in spacy.syntax.nn_parser.Parser.update
File "nn_parser.pyx", line 676, in spacy.syntax.nn_parser.Parser._init_gold_batch
File "ner.pyx", line 119, in spacy.syntax.ner.BiluoPushDown.preprocess_gold
File "ner.pyx", line 178, in spacy.syntax.ner.BiluoPushDown.lookup_transition
KeyError: 'B-SERVICE_CONSUMER'
I’m not worried about the WORK_OF_ART business. I assume that’s just a side effect of catastrophic forgetting.
Is the problem here that I have both JURISDICTION
and SERVICE_CONSUMER
in my training data but only JURISDICTION
as an argument for --label
? If I had only JURISDICTION
annotations in my dataset would this work?
(I know the usual answer to this kind of question is “try it and see” but unfortunately I’m debugging these problems for a client that can’t give me access to their system, so every attempt at a slightly different set of command line options requires that I email them to somebody. There’s usually a 24 hour turnaround time for this, so I want to understand a problem very thoroughly before I suggest a solution.)