ner correct with prodigy 1.11.8

hi @JulieSarah!

So mismatched tokenization can be a big problem that many users don't realize how important it is until it happens. Typically this happens when users have pre-annotated spans/entities that they load into manual recipes. These would be spans/entities they annotated in another tool, formatted the data for Prodigy, and then they use in the manual recipe (e.g., ner.manual) with a different tokenizer (e.g., blank:en or en_core_web_sm).

This is a good post that highlights it and provides some context on how to identify.

When you say you had a new tokenizer you created -- was this from scratch like this:

Did you train a ner component in a spaCy pipeline which included a custom tokenizer that was different than the ner annotations used to train?

Yeah - that could cause problems. Is there a compelling reason why you didn't make your annotations with the same custom tokenizer that you intend to include in the pipeline?

I suspect it may be you made the annotations first. Then after reviewing those, you found some problems with the spaCy (default tokenizer) so you decided to build a custom tokenizer but didn't want to redo all of the annotations.

I would recommend using the code from the earlier post to determine which annotations have a mismatch.

nlp = spacy.load("my_custom_model")  # model with your custom_tokenizer

for example in examples:  # your existing annotations
    doc = nlp(example["text"])
    for span in example["spans"]:
        char_span = doc.char_span(span["start"], span["end"])
        if char_span is None:  # start and end don't map to tokens
            print("Misaligned tokens", example["text"], span)

Typically, it may not be a lot of examples that have mismatch. You could re-annotate only those annotations that have mismatch but this time with your custom tokenizer.

Alternatively, you could try this package to "align" your annotations' tokenization to your new custom tokenizer:

I haven't used this package so I can't provide a lot more suggestions (but hopefully the package is self-explanatory).

Hope this helps!