Hi! Are you able to find that particular example it fails on in your data? The basic problem here is this: the character offsets in your annotations are referring to spans of texts that do not map to the token boundaries produced by the tokenizer. This means that an NER model, which predicts token-based tags, couldn't be updated with this information and learn anything from it, because it will nevr actually get to see those tokens at runtime.
For instance, imagine you have a text like "AB C"
, which the tokenizer will split into ["AB", "C"]
. If your data annotates a span describing the character offsets for A
, this will be a problem, because there isn't a token for A
that can be labelled.
So it comes down to finding those cases in your data so you can see what the main problems are. I've posted a script on this thread that you can use to find mismatched character offsets programmatically:
Sometimes, the problem can be caused by basic preprocessing issues and leading/trailing whitespace, or even off-by-one errors in the character offsets (if you've generated them programmatically). Those can be usually be fixed pretty easily in a preprocessing step.
In other cases, it can point to tokenization settings that you might want to adjust to better fit the data you're working with. For instance, a string like "hello.world"
, which the tokenizer preserves as one token ["hello.world"]
, but which you might want to split into 3 tokens ["hello", ".", "world"]
. If there a common patterns like this in your data, you can adjust the tokenization rules to be stricter and split more by modifying the punctuation rules: Linguistic Features · spaCy Usage Documentation
In general, it's good to tackle this early on and make sure you have a tokenizer in place that produces tokens that match the spans you're looking to annotate and predict – otherwise, you can easily end up with data that your model can't learn from.