Matching tokenisation on pre-existing annotated data

We have pre-existing data annotated for NER that I'd like to use prodigy to review and correct before training an NER model on it. It's in the (text, {entitiies: ...}) spacy training format but not tokenised.

Having converted this to {text: ..., spans: ...} as input to ner.manual, a few examples load fine but some of the existing spans do not exactly land on the tokens the model is generating and I see a 'ValueError: Mismatched tokenization'.

We'd like to 'snap' the existing spans to agree with spacy tokenisation, so my question is how's the best way to go about this?

thanks in advance!

Hi! The best approach depends on what the underlying mismatches are. If you need a better overview, you could write a small script that processes your texts with spaCy's default tokenizer and then calls doc.char_span with the start and end offsets. If that returns None, you have a mismatch.

Maybe you'll find that it mostly comes down to things like punctuation (e.g. hyphens) that can be adjusted by slightly tweaking spaCy's tokenization to better match your data. Or maybe it turns out that some of your existing annotated entities have off-by-one errors or include leading or trailing whitespace. That can be easily fixed programmatically by adjusting the start / end offset by 1 if the text span starts or ends with whitespace.

If you find cases with actual mistakes that need to be re-annotated, you could also queue them up again for annotation in Prodigy without the mismatched "spans" (and maybe add the info about the original entity to the task's "meta" so it's displayed in the corner and you can try to reannotate it using the existing tokenization).

Thanks for the swift reply. The pointer to char_span was very helpful. Most were slight whitespace disagreement and readily fixed.

1 Like