KeyError: 'token_end' when trying to use ner.batch-train

prodigy ner.batch-train brand_tagging en_core_web_sm --output /models/brand_tag_alpha_2019_06_04 --label BRAND

Using 1 labels: BRAND

Loaded model en_core_web_sm
Using 20% of accept/reject examples (431) for evaluation

Traceback (most recent call last):
  File "/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/anaconda3/lib/python3.7/prodigy/__main__.py", line 380, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/anaconda3/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/anaconda3/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/anaconda3/lib/python3.7/prodigy/recipes/ner.py", line 602, in batch_train
    examples = list(split_sentences(model.orig_nlp, examples))
  File "cython_src/prodigy/components/preprocess.pyx", line 58, in split_sentences
KeyError: 'token_end'
>>> prodigy.__version__
'1.8.2'
>>> spacy.__version__
'2.1.4'

Any thoughts as to what to pursue to fix this?

This is strange :thinking: The quickest way to work around this for now is to just disable the sentence segmentation by setting --unsegmented.

What’s in your brand_tagging dataset and where does that data come from? Was it created with a manual recipe, and is it possible that there are differences / mismatches in the tokenization? I just had a look and the most likely explanation I have is that it comes across mismatched tokenization between the annotated spans and the split tokens, and Prodigy doesn’t handle the error nicely.

Thanks for your incredibly quick response!

What’s in your brand_tagging dataset and where does that data come from? Was it created with a manual recipe, and is it possible that there are differences / mismatches in the tokenization?

The data was created with Prodigy like:

prodigy ner.teach brand_tagging en_core_web_sm data_1018.jsonl --label BRAND --patterns brand_patterns.jsonl

…and then a month later I resumed training with a new dataset:

prodigy ner.teach brand_tagging en_core_web_sm data_0219.jsonl --label BRAND --patterns brand_patterns.jsonl

The data_*.jsonl files were created using the exact same code and appear to be the same, but perhaps it has to do with resuming training with a new dataset and not completing the task? When Prodigy began showing me lots of very low prediction score examples I clicked the disk in the top left to save, and then ended the process in Terminal with Ctrl-C.

Thanks for the quick reply – this all sounds good! Resuming the training shouldn’t be a problem, so I don’t think it’s that. (It might explain the randomness of the suggestions, though – when you restarted training with the base English model, you essentially started again from zero.)

The error looks like Prodigy might have mapping the entity spans in your annotated data to valid tokens, which is pretty unlikely, and then didn’t catch the error nicely. When you have a second, could you run the following code and see if it causes an error? And if so, share the error message?

import spacy
from prodigy.components.db import connect
from prodigy.components.preprocess import add_tokens

nlp = spacy.load("en_core_web_sm")
db = connect()
examples = db.get_dataset("brand_tagging")
examples_with_tokens = list(add_tokens(nlp, examples))

Running that code snippet now, in the mean time, could you briefly describe how to “resume annotating”, with new data, instead of starting over from scratch?

EDIT: Found my answer here: Resume Annotation Session with Prodigy - Text Classification

Appears that the best workflow for what I’m trying to do (add new annotations based on a previously annotated dataset + model) would be to batch-train the model after your first annotation session, and then export it, and then when you begin annotating on a new dataset, use that model as your base instead of the base English model.

When you have a second, could you run the following code and see if it causes an error? And if so, share the error message?

>>> examples_with_tokens = list(add_tokens(nlp, examples))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "cython_src/prodigy/components/preprocess.pyx", line 130, in add_tokens
  File "cython_src/prodigy/components/preprocess.pyx", line 159, in prodigy.components.preprocess._add_tokens
ValueError: Mismatched tokenization. Can't resolve span to token index 82. This can happen if your data contains pre-set spans. Make sure that the spans match spaCy's tokenization or add a 'tokens' property to your task. 

{'start': 82, 'end': 84, 'text': '27', 'rank': 3, 'label': 'BRAND', 'score': 0.06382654830000001, 'source': 'en_core_web_sm', 'input_hash': -1505595276}

This worked in that it ran, although it appears something else is corrupted somewhere…

BEFORE      0.000             
Correct     0   
Incorrect   316
Entities    808               
Unknown     0                 

#            LOSS         RIGHT        WRONG        ENTS         SKIP         ACCURACY  
01           23558.318    86           230          804          0            0.272                  
02           20139.017    76           240          389          0            0.241                  
03           18896.438    80           236          280          0            0.253                  
04           18204.819    83           233          253          0            0.263                  
05           17361.884    93           223          217          0            0.294                    

Prior to upgrading, I was getting 80+% accuracy on my ~5k annotations, now it appears there are no correct answers in the database :thinking:

Not sure what I could’ve mucked up here :cry: … I tried exporting the dataset to jsonl and then re-loading it as a new dataset, but that also didn’t work. I ensured that there are in fact a number of accept examples in here and that the annotations look basically right.

Did you train with the exact same data before?

The Correct: 0 refers to the evaluation before training, so that’s expected. You’re training a new label BRAND and you’re starting off with a pre-trained base model that has never seen that label before. So before training, it will naturally not predict a single one of your examples correctly. After 5 iterations, it gets 93 of your examples correct.

Okay, it looks like you might have hit one of the very rare cases where the tokenization in the previous spaCy model differs from the tokenization now, and where you’ve actually annotated a span that doesn’t map to valid tokens anymore. It shouldn’t matter very much, though, in your case.

Edit: Just released v1.8.3, which fixes the underlying issue in the sentence segmentation preprocessor that caused this not-so-nice error. The mismatched spans should now be excluded by default.

1 Like

Fantastic! Downloading now.

EDIT: New error, I get this one whether I run --unsegmented or not:

prodigy ner.batch-train brand_tagging models/ner_model_alpha_2019_06_07 --output /models/ner_model_alpha_2019_06_07_v2 --label BRAND
Using 1 labels: BRAND

Loaded model models/ner_model_alpha_2019_06_07
Using 20% of accept/reject examples (663) for evaluation
Using 100% of remaining examples (2864) for training
Dropout: 0.2  Batch size: 16  Iterations: 10  


BEFORE      0.909            
Correct     451
Incorrect   45
Entities    688              
Unknown     48               

#            LOSS         RIGHT        WRONG        ENTS         SKIP         ACCURACY  
Traceback (most recent call last):                                                                   
  File "/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/anaconda3/lib/python3.7/site-packages/prodigy/__main__.py", line 380, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 212, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "/anaconda3/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/anaconda3/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/anaconda3/lib/python3.7/site-packages/prodigy/recipes/ner.py", line 621, in batch_train
    examples, batch_size=batch_size, drop=dropout, beam_width=beam_width
  File "cython_src/prodigy/models/ner.pyx", line 362, in prodigy.models.ner.EntityRecognizer.batch_train
  File "cython_src/prodigy/models/ner.pyx", line 425, in prodigy.models.ner.EntityRecognizer._update
  File "cython_src/prodigy/models/ner.pyx", line 388, in prodigy.models.ner.EntityRecognizer.predict_best
  File "cython_src/prodigy/models/ner.pyx", line 62, in prodigy.models.ner._BatchBeam.__init__
  File "nn_parser.pyx", line 302, in spacy.syntax.nn_parser.Parser.beam_parse
  File "nn_parser.pyx", line 386, in spacy.syntax.nn_parser.Parser.transition_beams
  File "search.pyx", line 149, in thinc.extra.search.Beam.advance
AssertionError

I believe this might have to do with trying to run ner.batch-train on a model that has had EntityRuler added to the pipeline. An identical model without EntityRuler trains without problem. This could be a mistake on my part at this point, I wonder if perhaps I need to add a Factory.

Ah, interesting! This is likely a problem in spaCy – the model seems to get confused by the pre-existing entities set in the pipeline. We need to think about how to best handle this – in the meantime, leaving the entity ruler out of the pipeline during training is probably the best solution. You can always write a little script that adds it afterwards and re-saves the model.

1 Like