Mismatched tokenization

hi @ttandel!

It seems the character offsets in your annotations are referring to spans of texts that do not map to the token boundaries produced by the tokenizer (blank:en) you're using in ner.manual.

What tokenizer did you use to get the pre-set spans?

Any chance you could redo the pre-set spans to use the same tokenizer as ner.manual (i.e., use the blank:end)?

This would be the easiest fix but I suspect it may not be possible. There are a lot (~39) posts with the keyword "mismatched tokenization" that can help.

I couldn't go through all of them, but this one gives you a snippet of code to identify which are the mismatched spans:

Similarly, as this post discusses, the best course of action depends on what are the mismatches:

See if you can use these posts to help and feel free to post back questions if you run into issues (or let us know if you're able to fix it!)

Also, a small ask, moving forward, please refrain from posting images of the code and/or output. It's easier to use the markdown code output as this will allow us to copy/paste (especially for code) and also make the code searchable instead of images. Thank you :slight_smile: