Mismatched Tokenization on NER

ines · June 25, 2021, 3:08am

Hi! Are you able to find that particular example it fails on in your data? The basic problem here is this: the character offsets in your annotations are referring to spans of texts that do not map to the token boundaries produced by the tokenizer. This means that an NER model, which predicts token-based tags, couldn't be updated with this information and learn anything from it, because it will nevr actually get to see those tokens at runtime.

For instance, imagine you have a text like "AB C", which the tokenizer will split into ["AB", "C"]. If your data annotates a span describing the character offsets for A, this will be a problem, because there isn't a token for A that can be labelled.

So it comes down to finding those cases in your data so you can see what the main problems are. I've posted a script on this thread that you can use to find mismatched character offsets programmatically:

Anotation task format for ner_manual interface

An easy way to do this is to use spaCy’s Doc.char_span method, which creates a token span from character offsets. If the character offsets don’t align to the tokens, it returns None. So you can do something like this:
nlp = spacy.load("en_core_web_sm")  # or other model

for example in examples:  # your existing examples
    doc = nlp(example["text"])
    for span in example["spans"]:
        char_span = doc.char_span(span["start"], span["end"])
        if char_span is None:  # start and end don't map to tokens
            print("Misaligned tokens", example["text"], span)

Sometimes, the problem can be caused by basic preprocessing issues and leading/trailing whitespace, or even off-by-one errors in the character offsets (if you've generated them programmatically). Those can be usually be fixed pretty easily in a preprocessing step.

In other cases, it can point to tokenization settings that you might want to adjust to better fit the data you're working with. For instance, a string like "hello.world", which the tokenizer preserves as one token ["hello.world"], but which you might want to split into 3 tokens ["hello", ".", "world"]. If there a common patterns like this in your data, you can adjust the tokenization rules to be stricter and split more by modifying the punctuation rules: Linguistic Features · spaCy Usage Documentation

In general, it's good to tackle this early on and make sure you have a tokenizer in place that produces tokens that match the spans you're looking to annotate and predict – otherwise, you can easily end up with data that your model can't learn from.

Topic		Replies	Views
No Task Available ner , spacy , solved	14	1188	June 10, 2021
Mismatching spans usage , ner , solved	3	336	July 15, 2021
Mismatched tokenization	1	514	September 13, 2022
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	553	March 27, 2020
revising annotation by prodigy--here only one label (DATE) usage , ner , solved	16	1931	May 20, 2019

Mismatched Tokenization on NER

Related topics