Mismatched tokenization

Hi Team,
I am trying ner.manual recipe on dataset with pre-set spans. 90% of the time its working as expected but 10% records are throwing exception.

"ValueError: Mismatched tokenization. Can't resolve span to token index 228. This can happen if your data contains pre-set spans. Make sure that the spans match spaCy's tokenization or add a 'tokens' property to your task"

input text is as below :
{"text":"Contract Price:\r\nFor each Calculation Period, the price of the LAST NYM TRADING DAY of each NYMEX Natural Gas Futures contract from and\r\nincluding the February 2022 contract to and including the February 2022 contract PLUS USD .4 per MMBTU.","spans":[{"start":63,"end":83,"label":"prcIndex"},{"start":92,"end":117,"label":"prcIndex"},{"start":195,"end":208,"label":"prcWindow"},{"start":218,"end":222,"label":"prcDiffSign"},{"start":223,"end":226,"label":"prcCCY"},{"start":228,"end":229,"label":"prcDiffAmt"},{"start":234,"end":239,"label":"prcUOM"}]}

Could you please point me out where is the issue and what should I do to fix this ?

Thanks,

hi @ttandel!

It seems the character offsets in your annotations are referring to spans of texts that do not map to the token boundaries produced by the tokenizer (blank:en) you're using in ner.manual.

What tokenizer did you use to get the pre-set spans?

Any chance you could redo the pre-set spans to use the same tokenizer as ner.manual (i.e., use the blank:end)?

This would be the easiest fix but I suspect it may not be possible. There are a lot (~39) posts with the keyword "mismatched tokenization" that can help.

I couldn't go through all of them, but this one gives you a snippet of code to identify which are the mismatched spans:

Similarly, as this post discusses, the best course of action depends on what are the mismatches:

See if you can use these posts to help and feel free to post back questions if you run into issues (or let us know if you're able to fix it!)

Also, a small ask, moving forward, please refrain from posting images of the code and/or output. It's easier to use the markdown code output as this will allow us to copy/paste (especially for code) and also make the code searchable instead of images. Thank you :slight_smile: