I am loading pre-annotated data into prodigy ner.train, as i want to reduce annotation workload. Unfortunately, the software used to annotate named entities gives mismatched tokens that do not align with the spacy tokenizer I am using.
I get the following error, and prodigy does not load any more data into the UI. Is there any way I can throw an exception instead, that ensures cases with misaligned entities are ignored by prodigy in the annotation step?
ValueError: Mismatched tokenization. Can't resolve span to token index 594. This can happen if your data contains pre-set spans. Make sure that the spans match spaCy's tokenization or add a 'tokens' property to your task.
Hi! The easiest solution would be to adjust the recipe script you're using and to set skip=True on the add_tokens preprocessor. This will not raise the error and instead just skip the span that can't be mapped to tokens.
You can also easily check which spans don't align by using spaCy's Doc.char_span method, which tries to generate a character-based span and returns None if it's not possible. This can sometimes be helpful because maybe there's a common pattern/problem that's easy to adjust programmatically (whitespace, offsets etc.). Here's a simple example:
nlp = spacy.blank("en") # whichever model/tokenizer you want to use
for eg in your_examples_here:
doc = nlp(eg["text"])
for span in eg.get("spans", []):
char_span = doc.char_span(span["start"], span["end"])
if char_span is None:
print("Misaligned span", span)