KeyError: 'U-quote'

Hello,
I am trying to train a model with new entity type started with a very small number of annotations and seed list, can someone give me guidance on what the following error my mean?
see below for command and stack trace
Regards
RK

File “ner.pyx”, line 178, in spacy.syntax.ner.BiluoPushDown.lookup_transition
KeyError: ‘U-quote’

This is the call and the stack trace:
Anaconda3\Prodigy>python -m prodigy ner.batch-train annotate_1 en_core_web_lg
–output quotes-model --label quote --eval-split 0.5 --n-iter 6 --batch-size 2

Loaded model en_core_web_lg
Using 50% of examples (26) for evaluation
Using 100% of remaining examples (27) for training
Dropout: 0.2 Batch size: 2 Iterations: 6

BEFORE 0.000
Correct 0
Incorrect 9
Entities 26
Unknown 0

LOSS RIGHT WRONG ENTS SKIP ACCURACY

7%|¦¦¦¦¦¦¦ | 2/27 [00:00<00:10, 2
‘O’, ‘U-quote’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’, ‘O’]
[‘U-quote’, ‘O’, ‘O’, ‘O’]
Traceback (most recent call last):
File “C:\Users\rkeyvani\Anaconda3\lib\runpy.py”, line 193, in _run_module_as_main
main”, mod_spec)
File “C:\Users\rkeyvani\Anaconda3\lib\runpy.py”, line 85, in run_code
exec(code, run_globals)
File "C:\Users\rkeyvani\Anaconda3\lib\site-packages\prodigy_main
.py", line 242, in
controller = recipe(args, use_plac=True)
File “cython_src\prodigy\core.pyx”, line 150, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File “C:\Users\rkeyvani\Anaconda3\lib\site-packages\plac_core.py”, line 328, in call
cmd, result = parser.consume(arglist)
File “C:\Users\rkeyvani\Anaconda3\lib\site-packages\plac_core.py”, line 207, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File “C:\Users\rkeyvani\Anaconda3\lib\site-packages\prodigy\recipes\ner.py”, line 345, in batch_train
drop=dropout)
File “cython_src\prodigy\models\ner.pyx”, line 302, in prodigy.models.ner.EntityRecognizer.batch_train
File “cython_src\prodigy\models\ner.pyx”, line 361, in prodigy.models.ner.EntityRecognizer._update
File “cython_src\prodigy\models\ner.pyx”, line 356, in prodigy.models.ner.EntityRecognizer._update
File “C:\Users\rkeyvani\Anaconda3\lib\site-packages\spacy\language.py”, line 407, in update
proc.update(docs, golds, drop=drop, sgd=get_grads, losses=losses)
File “nn_parser.pyx”, line 558, in spacy.syntax.nn_parser.Parser.update
File “nn_parser.pyx”, line 676, in spacy.syntax.nn_parser.Parser._init_gold_batch
File “ner.pyx”, line 119, in spacy.syntax.ner.BiluoPushDown.preprocess_gold
File “ner.pyx”, line 178, in spacy.syntax.ner.BiluoPushDown.lookup_transition
KeyError: ‘U-quote’

Are you using the latest version of Prodigy (v1.1.0)? We made a few changes to the new entity type workflow, including letting ner.batch-train add new labels automatically. It definitely looks like the problem here is that the new label quote isn’t added to the entity recognizer or known by the model, so spaCy complains when it’s trying to update.

I also have this issue and I am definitely on version v1.1.0. I checked my process against the new workflow video for adding a label, and I seem to be following the prescribed process.

I have 1.1.0 and Spacy v2.0.5 installed, the error may mean something else its hard to tell given the stack trace.

Spacy V2.0.5 and
prodigy 1.1.0

Installed models (spaCy v2.0.5)
lib\site-packages\spacy

TYPE NAME MODEL VERSION
package en-core-web-sm en_core_web_sm 2.0.0
package en-core-web-lg en_core_web_lg 2.0.0
link en_core_web_lg en_core_web_lg 2.0.0

Sorry about this — I’m not 100% sure why the fix we put into v1.1.0 for this isn’t working. My best guess is that somewhere in the code assumes that the entity types are upper-cased, since that’s how they are in spaCy’s training data, and I’m not sure I’ve tested lower-case types. So this could be the bug, and it could be in either spaCy or Prodigy.

Either way it should be easy to give a workaround to keep you productive until we can push the next version in early January.

The error means that the spaCy NER model can’t find the entity type you’re adding, quote. You should be able to edit the file C:\Users\rkeyvani\Anaconda3\lib\site-packages\prodigy\recipes\ner.py, and edit the batch_train recipe. Here’s how the recipe function should look:

def batch_train(dataset, input_model, output_model=None, label='', factor=1,
                dropout=0.2, n_iter=10, batch_size=32, beam_width=16,
                eval_id=None, eval_split=None, silent=False):
    """
    Batch train a Named Entity Recognition model from annotations. Prodigy will
    export the best result to the output directory, and include a JSONL file of
    the training and evaluation examples. You can either supply a dataset ID
    containing the evaluation data, or choose to split off a percentage of
    examples for evaluation.
    """
    log("RECIPE: Starting recipe ner.batch-train", locals())
    print_ = get_print(silent)
    random.seed(0)
    nlp = spacy.load(input_model)
    print_("\nLoaded model {}".format(input_model))
    if 'sentencizer' not in nlp.pipe_names and 'sbd' not in nlp.pipe_names:
        nlp.add_pipe(nlp.create_pipe('sentencizer'), first=True)
        log("RECIPE: Added sentence boundary detector to model pipeline",
            nlp.pipe_names)
    examples = merge_spans(DB.get_dataset(dataset))
    random.shuffle(examples)
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        ner.cfg['pretrained_dims'] = 300
        for eg in examples:
            for span in eg.get('spans', []):
                ner.add_label(span['label'])
        for l in label:
            ner.add_label(l)
        nlp.add_pipe(ner, last=True)
        nlp.begin_training()
    else:
        ner = nlp.get_pipe('ner')
        for l in label:
            ner.add_label(l)
    if eval_id:
        evals = DB.get_dataset(eval_id)
        print_("Loaded {} evaluation examples from '{}'"
               .format(len(evals), eval_id))
    else:
        examples, evals, eval_split = split_evals(examples, eval_split)
        print_("Using {}% of accept/reject examples ({}) for evaluation"
               .format(round(eval_split * 100), len(evals)))
    model = EntityRecognizer(nlp, label=label)
    log('RECIPE: Initialised EntityRecognizer with model {}'
        .format(input_model), model.nlp.meta)
    examples = list(split_sentences(model.orig_nlp, examples))
    evals = list(split_sentences(model.orig_nlp, evals))
    baseline = model.evaluate(evals)
    log("RECIPE: Calculated baseline from evaluation examples "
        "(accuracy %.2f)" % baseline['acc'])
    best = None
    random.shuffle(examples)
    examples = examples[:int(len(examples) * factor)]
    print_(printers.trainconf(dropout, n_iter, batch_size, factor,
                              len(examples)))
    print_(printers.ner_before(**baseline))
    if len(evals) > 0:
        print_(printers.ner_update_header())

    for i in range(n_iter):
        losses = model.batch_train(examples, batch_size=batch_size,
                                   drop=dropout, beam_width=beam_width)
        stats = model.evaluate(evals)
        if best is None or stats['acc'] > best[0]:
            model_to_bytes = None
            if output_model is not None:
                model_to_bytes = model.to_bytes()
            best = (stats['acc'], stats, model_to_bytes)
        print_(printers.ner_update(i, losses, stats))
    best_acc, best_stats, best_model = best
    print_(printers.ner_result(best_stats, best_acc, baseline['acc']))
    if output_model is not None:
        model.from_bytes(best_model)
        msg = export_model_data(output_model, model.nlp, examples, evals)
        print_(msg)
    best_stats['baseline'] = baseline['acc']
    best_stats['acc'] = best_acc
    return best_stats

This function is what you’re executing when you run prodigy ner.batch_train. The label quote should be added to the NER model by these lines:

        ner = nlp.get_pipe('ner')
        for l in label:
            ner.add_label(l)

I would expect the variable label to have the contents ["quote"]. The call to ner.add_label should result in a call to Parser.add_label() here: https://github.com/explosion/spaCy/blob/master/spacy/syntax/nn_parser.pyx#L806 . After the call to ner.add_label("quote"), we should see the U-quote label within the parser’s transition system. We can check this by printing [ner.moves.get_class_name(i) for i in range(ner.moves.n_moves)].

Could you check:

a) That your batch_train function looks as above;
b) That the label variable has the expected contents, ["quote"]
c) That after the call to ner.add_label(), the U-quote is a known action within ner.moves

If all of a, b and c are true, then the bug is definitely in spaCy.

Hello,
Okay it seems like this is missing from my ner.py, should i just pop it in or do you have a place where i can grab the correct ner.py in case you refactored something else? Or perhaps I can just plob in your entire batch train from above? Let me know thanks.

missing…
for l in label:
ner.add_label(l)
nlp.add_pipe(ner, last=True)
nlp.begin_training()
else:
ner = nlp.get_pipe(‘ner’)
for l in label:
ner.add_label(l)

My ner.py was also missing the critical code. I uninstalled and re-installed just in case, but it was the same. If it helps, this is what I have

def batch_train(dataset, input_model, output_model=None, label='', factor=1,
                dropout=0.2, n_iter=10, batch_size=32, eval_id=None,
                eval_split=None, silent=False):  # pragma: no cover
    """
    Batch train a Named Entity Recognition model from annotations. Prodigy will
    export the best result to the output directory, and include a JSONL file of
    the training and evaluation examples. You can either supply a dataset ID
    containing the evaluation data, or choose to split off a percentage of
    examples for evaluation.
    """
    log("RECIPE: Starting recipe ner.batch-train", locals())
    print_ = get_print(silent)
    random.seed(0)
    nlp = spacy.load(input_model)
    print_("\nLoaded model {}".format(input_model))
    if 'sentencizer' not in nlp.pipe_names and 'sbd' not in nlp.pipe_names:
        nlp.add_pipe(nlp.create_pipe('sentencizer'), first=True)
        log("RECIPE: Added sentence boundary detector to model pipeline",
            nlp.pipe_names)
    examples = merge_spans(DB.get_dataset(dataset))
    random.shuffle(examples)
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        ner.cfg['pretrained_dims'] = 300
        for eg in examples:
            for span in eg.get('spans', []):
                ner.add_label(span['label'])
        nlp.add_pipe(ner, last=True)
        nlp.begin_training()
    if eval_id:
        evals = DB.get_dataset(eval_id)
        print_("Loaded {} evaluation examples from '{}'"
               .format(len(evals), eval_id))
    else:
        examples, evals, eval_split = split_evals(examples, eval_split)
        print_("Using {}% of examples ({}) for evaluation"
               .format(round(eval_split * 100), len(evals)))
    model = EntityRecognizer(nlp, label=label)
    log('RECIPE: Initialised EntityRecognizer with model {}'
        .format(input_model), model.nlp.meta)
    examples = list(split_sentences(model.orig_nlp, examples))
    evals = list(split_sentences(model.orig_nlp, evals))
    baseline = model.evaluate(evals)
    log("RECIPE: Calculated baseline from evaluation examples "
        "(accuracy %.4f)" % baseline['acc'])
    best = None
    random.shuffle(examples)
    examples = examples[:int(len(examples) * factor)]
    print_(printers.trainconf(dropout, n_iter, batch_size, factor,
                              len(examples)))
    print_(printers.ner_before(**baseline))
    if len(evals) > 0:
        print_(printers.ner_update_header())

    for i in range(n_iter):
        losses = model.batch_train(examples, batch_size=batch_size,
                                   drop=dropout)
        stats = model.evaluate(evals)
        if best is None or stats['acc'] > best[0]:
            best = (stats['acc'], stats, model.to_bytes())
        print_(printers.ner_update(i, losses, stats))
    best_acc, best_stats, best_model = best
    print_(printers.ner_result(best_stats, best_acc, baseline['acc']))
    if output_model is not None:
        model.from_bytes(best_model)
        msg = export_model_data(output_model, model.nlp, examples, evals)
        print_(msg)
    best_stats['baseline'] = baseline['acc']
    best_stats['acc'] = best_acc
    return best_stats

It works after I replaced batch_train with the given code and removed beam_width as a parameter in the line

losses = model.batch_train(examples, batch_size=batch_size,
                                           drop=dropout, beam_width=beam_width)

@rkeyvani @farlee2121 I think I found the problem. You’re both on Windows, right? It looks like the Windows wheel didn’t build the correct version – really sorry about that. Our Windows build process is more complex than the macOS and Linux one, so we’re currently building those manually. But we’re getting better with every version, and I’m currently working on starting the new build, so I can reupload the correct file.

So in the meantime, adding the lines for adding the labels or using a Linux or macOS machine should be the best workaround. Will update as soon as the Windows wheel is live! Thanks again for your patience.

Okay, I just rebuilt and reuploaded the wheel – you should be able to just re-download it via your download link. Could you try and let me know if it all works now? Sorry again!

Seems to be working now

1 Like