Matching tokenisation on pre-existing annotated data

Hi! The best approach depends on what the underlying mismatches are. If you need a better overview, you could write a small script that processes your texts with spaCy's default tokenizer and then calls doc.char_span with the start and end offsets. If that returns None, you have a mismatch.

Maybe you'll find that it mostly comes down to things like punctuation (e.g. hyphens) that can be adjusted by slightly tweaking spaCy's tokenization to better match your data. Or maybe it turns out that some of your existing annotated entities have off-by-one errors or include leading or trailing whitespace. That can be easily fixed programmatically by adjusting the start / end offset by 1 if the text span starts or ends with whitespace.

If you find cases with actual mistakes that need to be re-annotated, you could also queue them up again for annotation in Prodigy without the mismatched "spans" (and maybe add the info about the original entity to the task's "meta" so it's displayed in the corner and you can try to reannotate it using the existing tokenization).