Hi folks!
Just started with Prodigy, and I've been using the spancat_food_ingredients spacy project as a scaffold for learning.
I noticed there were a lot of examples where I wasn't able to annotate the full span that I wanted to: in the web interface, Prodigy showed a gap between some words (and often dropped to a new line), and any span that included that gap wouldn't be kept (you highlight and then when you let go, nothing happens). Hovering over these gaps showed a red-circle-with-slash icon.
Looking at the data in the project, it seemed to mostly correlate to double and triple spaces: when I removed those from the data, that seemed to remove almost all the span-interrupter-gaps. It seems like Prodigy's span annotation may have trouble with repeated whitespace characters.
Looking to learn how to prevent my real annotation tasks from having similar issues:
- Is this expected/known behaviour for Prodigy?
- Are there configurations or suggested workflows that could be employed so that those spans could still be annotated, or is it more a data-cleaning problem?
- Is there a known list or suggestive list of character combinations that might cause a similar issues?
Thanks for your help.