hi @JulieSarah!
So mismatched tokenization can be a big problem that many users don't realize how important it is until it happens. Typically this happens when users have pre-annotated spans/entities that they load into manual recipes. These would be spans/entities they annotated in another tool, formatted the data for Prodigy, and then they use in the manual recipe (e.g., ner.manual
) with a different tokenizer (e.g., blank:en
or en_core_web_sm
).
This is a good post that highlights it and provides some context on how to identify.
When you say you had a new tokenizer you created -- was this from scratch like this:
Did you train a ner
component in a spaCy pipeline which included a custom tokenizer that was different than the ner
annotations used to train?
Yeah - that could cause problems. Is there a compelling reason why you didn't make your annotations with the same custom tokenizer that you intend to include in the pipeline?
I suspect it may be you made the annotations first. Then after reviewing those, you found some problems with the spaCy (default tokenizer) so you decided to build a custom tokenizer but didn't want to redo all of the annotations.
I would recommend using the code from the earlier post to determine which annotations have a mismatch.
nlp = spacy.load("my_custom_model") # model with your custom_tokenizer
for example in examples: # your existing annotations
doc = nlp(example["text"])
for span in example["spans"]:
char_span = doc.char_span(span["start"], span["end"])
if char_span is None: # start and end don't map to tokens
print("Misaligned tokens", example["text"], span)
Typically, it may not be a lot of examples that have mismatch. You could re-annotate only those annotations that have mismatch but this time with your custom tokenizer.
Alternatively, you could try this package to "align" your annotations' tokenization to your new custom tokenizer:
I haven't used this package so I can't provide a lot more suggestions (but hopefully the package is self-explanatory).
Hope this helps!