Multiple issues with character based annotation

Glad to see there is character based options now, the --hightlight-chars for ner.manual.
However I encountered issues during annotation and training.

  1. During annotation, if I select across an entire word, but without selecting the trailing space, it appears to have selected the word on the UI. However when releasing mouse the last character is not selected.

  2. During annotation, if I wish to select something that starts from the very first character, I sometimes do a backward selection (from the end of span) because it's faster and do not need to do a pixel perfect selection. However if I move beyond the first character and out of the drop shadowed part of the annotation UI then release the mouse, it does not register as a mouse release.

  3. During annotation if I first select 1 character, and without releasing mouse I reverse that selection, now releasing mouse can cause a zero character selection. It cause a permanent bug during this session as there is no way to cancel or deselect that annotation without restart.

  4. There is no hightlight-chars for ner.teach or ner.correct. Can this be added to the standard recipes so I can train character based model with active learning?

I'm on Prodigy nightly v1.11.0a10, macOS ARM wheel, with Firefox 91.0b3.

Thanks for the report!

I'll look into that! I remember there being some inconsistencies between how Chrome and Firefox handle highlighting with trailing whitespace, but from what I remember, Firefox was actually most consistent here. Does this only occur with characters that are followed by actual whitespace?

Do you have an example text that shows the problem? I've been trying to reproduce this but I can't seem to make the UI produce this type ouf output or zero-length spans :thinking:

That's going to be tricky to work around and I don't have a perfect solution for this at the moment. One idea could be to add some spacing around the main text container (.prodigy-spans) to give you more space for your cursor. You could also try this out by using global_css and adding some padding on the left.

That's definitely a bug, thanks! Zero-character spans shouldn't be valid, so we'll fix this to prevent them from being locked in.

This wouldn't really work. If you're training a model that predicts token-based tags, e.g. a named entity recognizer, the spans you provide need to map to valid tokens. So it doesn't really make sense to annotate characters here because you'll very easily end up with spans that the model can't learn from.

Now you said that, I was hard to reproduce that either... until I found a way.
It seems to related to whitespace. If you select the word + whitespace, then reverse the whitespace selection to only select the word, releasing mouse will also deselect last character.

It doesn't happen on Safari 15, so it's probably browser related.

Zero-length spans seems to be reproducable with directly selecting whitespace. But I'm also be able to reproduce it in the middle of a word without whitespace, it's just harder to trigger. Maybe try different way / speed to select. Doesn't seem to be related to text content though.

So the NER can not learn at character level? I was wishing it could learn segmentation as well. If that's the case I wonder what's hightlight-chars' actual use case?

Thanks, this is very helpful! I'll play with this in different browsers and see if I can reproduce it :+1:

NER models typically predict token-based tags. What those tokens are may differ – sometimes it's linguistically-motivated tokens (what you normally think of as a "word"), sometimes its word chunks like word piece tokens used by transformer models that are segmented based on what's most efficient to embed. But you usually want to work with at least some type of token definition, which also makes it easier to use pretrained embeddings. That said, in some languages that that don't really have the same concept of word = whitespace-delimited chunk (e.g. Chinese), it can make sense to work at the character level instead.

The character-based highlighting mostly exists because there are some use cases, where you might want to highlight individual characters (e.g. specific character-based implementations or very different types of models that predict characters or segmentation). But it's not usually something we recommend if you're training a token-based model because you'll easily end up with annotations of spans that don't map to actual tokens and can't be predicted or embedded.