NER training on dataset which was annotated on older version.

ines · January 26, 2021, 11:18pm

Hi! It looks like your data ended up with misaligned tokens (which old versions of spaCy quietly skipped, but which it now raises an error about explicitly). Did you use the same tokenizer during annotation and training?

One easy way to find the misaligned examples, check what's wrong and/or just exclude them from your dataset would be to load your Prodigy dataset and use spaCy's Doc.char_span method to check that all spans refer to valid tokens. If there are only a few problematic examples, you could just skip them and save the filtered examples to a new dataset.

import spacy
from prodigy.components.db import connect

db = connect()
examples = db.get_dataset("your_dataset_here")  # Prodigy dataset
nlp = spacy.blank("en")  # whichever language/model you used
for example in examples: 
    doc = nlp(example["text"])
    for span in example["spans"]:
        char_span = doc.char_span(span["start"], span["end"])
        if char_span is None:  # start and end don't map to tokens
            print("Misaligned tokens", example["text"], span)

Topic		Replies	Views
UserWarning: [W030] Some entities could not be aligned in the text usage , ner , spacy	1	1577	April 23, 2021
Error while training NER model usage , spacy , training	4	1849	September 16, 2021
ner.train on data not annotated by Spacy? ner	3	1148	June 11, 2018
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	552	March 27, 2020
Insert Exception to skip cases where tokens are misaligned. usage , ner , spacy	1	479	October 12, 2020

NER training on dataset which was annotated on older version.

Related topics