When does a custom entity label cause a crash in training?

I am training an NER model to detect two custom entity types JURISDICTION and SERVICE_CONSUMER. I have training data in a dataset called entity.dataset with annotations for both these types. I run the following Prodigy command

prodigy ner.batch-train entity.dataset en_core_web_lg --n-iter 5 --output-model model --label JURISDICTION

and get this crash stack before I see output for the first iteration

['U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-QUANTITY', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'B-ORG', 'L-ORG', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-PRODUCT', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-EVENT', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART', 'U-WORK_OF_ART']Traceback (most recent call last):
  File "/usr/lib64/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/pyenv/environments/ecms/lib/python3.5/site-packages/prodigy/__main__.py", line 242, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 150, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/pyenv/environments/ecms/lib/python3.5/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/pyenv/environments/ecms/lib/python3.5/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/pyenv/environments/ecms/lib/python3.5/site-packages/prodigy/recipes/ner.py", line 401, in batch_train
    drop=dropout, beam_width=beam_width)
  File "cython_src/prodigy/models/ner.pyx", line 304, in prodigy.models.ner.EntityRecognizer.batch_train
  File "cython_src/prodigy/models/ner.pyx", line 365, in prodigy.models.ner.EntityRecognizer._update
  File "cython_src/prodigy/models/ner.pyx", line 359, in prodigy.models.ner.EntityRecognizer._update
  File "cython_src/prodigy/models/ner.pyx", line 360, in prodigy.models.ner.EntityRecognizer._update
  File "/pyenv/environments/ecms/lib/python3.5/site-packages/spacy/language.py", line 407, in update
    proc.update(docs, golds, drop=drop, sgd=get_grads, losses=losses)
  File "nn_parser.pyx", line 558, in spacy.syntax.nn_parser.Parser.update
  File "nn_parser.pyx", line 676, in spacy.syntax.nn_parser.Parser._init_gold_batch
  File "ner.pyx", line 119, in spacy.syntax.ner.BiluoPushDown.preprocess_gold
  File "ner.pyx", line 178, in spacy.syntax.ner.BiluoPushDown.lookup_transition
KeyError: 'B-SERVICE_CONSUMER'

I’m not worried about the WORK_OF_ART business. I assume that’s just a side effect of catastrophic forgetting.

Is the problem here that I have both JURISDICTION and SERVICE_CONSUMER in my training data but only JURISDICTION as an argument for --label? If I had only JURISDICTION annotations in my dataset would this work?

(I know the usual answer to this kind of question is “try it and see” but unfortunately I’m debugging these problems for a client that can’t give me access to their system, so every attempt at a slightly different set of command line options requires that I email them to somebody. There’s usually a 24 hour turnaround time for this, so I want to understand a problem very thoroughly before I suggest a solution.)

Hmm. If you have a look at the source prodigy.recipes.ner.batch_train you should see the lines:

    # make sure all labels are present in the model
    for eg in examples:
        for span in eg.get('spans', []):
            ner.add_label(span['label'])
    for l in label:
        ner.add_label(l)

Are you on the latest version of Prodigy? I think we fixed a bug around this a version or two ago.