Insert Exception to skip cases where tokens are misaligned.

JBunr · October 9, 2020, 8:51am

I am loading pre-annotated data into prodigy ner.train, as i want to reduce annotation workload. Unfortunately, the software used to annotate named entities gives mismatched tokens that do not align with the spacy tokenizer I am using.

I get the following error, and prodigy does not load any more data into the UI. Is there any way I can throw an exception instead, that ensures cases with misaligned entities are ignored by prodigy in the annotation step?

ValueError: Mismatched tokenization. Can't resolve span to token index 594. This can happen if your data contains pre-set spans. Make sure that the spans match spaCy's tokenization or add a 'tokens' property to your task.

ines · October 12, 2020, 8:07am

Hi! The easiest solution would be to adjust the recipe script you're using and to set skip=True on the add_tokens preprocessor. This will not raise the error and instead just skip the span that can't be mapped to tokens.

You can also easily check which spans don't align by using spaCy's Doc.char_span method, which tries to generate a character-based span and returns None if it's not possible. This can sometimes be helpful because maybe there's a common pattern/problem that's easy to adjust programmatically (whitespace, offsets etc.). Here's a simple example:

nlp = spacy.blank("en")  # whichever model/tokenizer you want to use
for eg in your_examples_here:
    doc = nlp(eg["text"])
    for span in eg.get("spans", []):
        char_span = doc.char_span(span["start"], span["end"])
        if char_span is None:
            print("Misaligned span", span)

Topic		Replies	Views
Skip mismatched tokenization? usage , ner , spacy , solved	2	397	February 8, 2022
Mismatched Tokenization on NER usage , ner	2	1140	June 25, 2021
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	554	March 27, 2020
ner.manual task with add_tokens and skip=True fails with KeyError. ner , done	5	614	December 11, 2018
NER training on dataset which was annotated on older version. usage , ner , spacy	1	2264	January 26, 2021

Insert Exception to skip cases where tokens are misaligned.

Related topics