I recently did some labelling of a dataset in another labelling tool and want to port it all over to Prodigy now due to the greater control. I'm unfortunately having some issues. The output from my previous labelling tool gave me the character-wise start/end of a particular span. I've converted those spans over to a jsonl but am predictably having some issues with mismatched tokenization when importing to Prodigy.
Here, Ines suggests to use skip=True in the add_tokens preprocessor. Does this mean I have to create a custom recipe (instead of ner.manual) just to skip entities that fail tokenization?
Secondly, Ines also suggests a way to check which spans do not align properly with SpaCy's tokenization. In these scenarios, what can I actually do to fix the issue? Just eliminate them from my dataset?
You could do this in a custom recipe or alternatively, the quick and dirty solution would be to edit the recipe included with Prodigy. You can run prodigy stats to find the location of your Prodigy installation and then edit recipes/ner.py.
One option would be to just exclude the spans but it might also be worth looking at the affected spans in more detail to see if there's a common pattern that you can fix easily (and programmatically). Common problems are things like off-by-one errors or whitespace like leading or trailing spaces mistakenly included in the span. Or maybe it turns out that your data includes certain markup or unusual punctuation where it makes sense to include a custom tokenization rule that splits more aggressively on those characters so you end up with tokenization that better matches the spans you're looking for.
I ran your code to show the affected spans. It was less than 0.1% of my labels were problematic. I've removed them from my dataset. Thank you for putting up such great advice here!
Now I just need to figure out why my Prodigy process keeps getting terminated by Ubuntu overnight.