Double-spaces preventing manual span annotations

Hi folks!

Just started with Prodigy, and I've been using the spancat_food_ingredients spacy project as a scaffold for learning.

I noticed there were a lot of examples where I wasn't able to annotate the full span that I wanted to: in the web interface, Prodigy showed a gap between some words (and often dropped to a new line), and any span that included that gap wouldn't be kept (you highlight and then when you let go, nothing happens). Hovering over these gaps showed a red-circle-with-slash icon.

Looking at the data in the project, it seemed to mostly correlate to double and triple spaces: when I removed those from the data, that seemed to remove almost all the span-interrupter-gaps. It seems like Prodigy's span annotation may have trouble with repeated whitespace characters.

Looking to learn how to prevent my real annotation tasks from having similar issues:

  • Is this expected/known behaviour for Prodigy?
  • Are there configurations or suggested workflows that could be employed so that those spans could still be annotated, or is it more a data-cleaning problem?
  • Is there a known list or suggestive list of character combinations that might cause a similar issues?

Thanks for your help.

Hi and welcome! :waving_hand:

I think what you're seeing is the default behaviour that prevents newlines from being annotated as part of a span, because in many cases, you don't want spans to include whitespace. (It's an easy annotation mistake to make that's often tricky to spot but can have a significant impact on the model.) But you can turn this off by setting "allow_newline_highlight": true in your prodigy.json (also see here for available interface settings).

(Btw, under the hood, what makes tokens unselectable is setting "disabled: true" in the JSON data – so if this is ever relevant, you can also use it to your advantage! For example, if there are certain characters, words, punctuation that you know can and should never be part of a span, you can make sure that it's impossible to create invalid annotations. This type of automation can be super helpful for data quality :slightly_smiling_face:)