Matching tokenisation on pre-existing annotated data

ines · March 27, 2020, 10:51am

Hi! The best approach depends on what the underlying mismatches are. If you need a better overview, you could write a small script that processes your texts with spaCy's default tokenizer and then calls doc.char_span with the start and end offsets. If that returns None, you have a mismatch.

Maybe you'll find that it mostly comes down to things like punctuation (e.g. hyphens) that can be adjusted by slightly tweaking spaCy's tokenization to better match your data. Or maybe it turns out that some of your existing annotated entities have off-by-one errors or include leading or trailing whitespace. That can be easily fixed programmatically by adjusting the start / end offset by 1 if the text span starts or ends with whitespace.

If you find cases with actual mistakes that need to be re-annotated, you could also queue them up again for annotation in Prodigy without the mismatched "spans" (and maybe add the info about the original entity to the task's "meta" so it's displayed in the corner and you can try to reannotate it using the existing tokenization).

Topic		Replies	Views
ner.train on data not annotated by Spacy? ner	3	1148	June 11, 2018
Insert Exception to skip cases where tokens are misaligned. usage , ner , spacy	1	480	October 12, 2020
Skip mismatched tokenization? usage , ner , spacy , solved	2	396	February 8, 2022
Annotating strings without correct separation ner , best-practices	8	192	November 21, 2024
Error while training NER model usage , spacy , training	4	1854	September 16, 2021

Matching tokenisation on pre-existing annotated data

Related topics