Skip mismatched tokenization?

shensmobile · February 7, 2022, 11:51pm

Hi,

I recently did some labelling of a dataset in another labelling tool and want to port it all over to Prodigy now due to the greater control. I'm unfortunately having some issues. The output from my previous labelling tool gave me the character-wise start/end of a particular span. I've converted those spans over to a jsonl but am predictably having some issues with mismatched tokenization when importing to Prodigy.

I see that someone here asked a similar question to me: Insert Exception to skip cases where tokens are misaligned.

Here, Ines suggests to use skip=True in the add_tokens preprocessor. Does this mean I have to create a custom recipe (instead of ner.manual) just to skip entities that fail tokenization?

Secondly, Ines also suggests a way to check which spans do not align properly with SpaCy's tokenization. In these scenarios, what can I actually do to fix the issue? Just eliminate them from my dataset?

ines · February 8, 2022, 10:25am

You could do this in a custom recipe or alternatively, the quick and dirty solution would be to edit the recipe included with Prodigy. You can run prodigy stats to find the location of your Prodigy installation and then edit recipes/ner.py.

One option would be to just exclude the spans but it might also be worth looking at the affected spans in more detail to see if there's a common pattern that you can fix easily (and programmatically). Common problems are things like off-by-one errors or whitespace like leading or trailing spaces mistakenly included in the span. Or maybe it turns out that your data includes certain markup or unusual punctuation where it makes sense to include a custom tokenization rule that splits more aggressively on those characters so you end up with tokenization that better matches the spans you're looking for.

shensmobile · February 8, 2022, 4:21pm

Hi Ines,

I ran your code to show the affected spans. It was less than 0.1% of my labels were problematic. I've removed them from my dataset. Thank you for putting up such great advice here!

Now I just need to figure out why my Prodigy process keeps getting terminated by Ubuntu overnight.

Topic		Replies	Views
Insert Exception to skip cases where tokens are misaligned. usage , ner , spacy	1	479	October 12, 2020
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	552	March 27, 2020
ValueError: Mismatched tokenization. in ner.make-gold ner , done	5	1450	March 11, 2018
Mismatched Tokenization on NER usage , ner	2	1138	June 25, 2021
Annotating strings without correct separation ner , best-practices	8	192	November 21, 2024

Skip mismatched tokenization?

Related topics