Multiple issues with character based annotation

ines · July 22, 2021, 7:31am

Thanks, this is very helpful! I'll play with this in different browsers and see if I can reproduce it

NER models typically predict token-based tags. What those tokens are may differ – sometimes it's linguistically-motivated tokens (what you normally think of as a "word"), sometimes its word chunks like word piece tokens used by transformer models that are segmented based on what's most efficient to embed. But you usually want to work with at least some type of token definition, which also makes it easier to use pretrained embeddings. That said, in some languages that that don't really have the same concept of word = whitespace-delimited chunk (e.g. Chinese), it can make sense to work at the character level instead.

The character-based highlighting mostly exists because there are some use cases, where you might want to highlight individual characters (e.g. specific character-based implementations or very different types of models that predict characters or segmentation). But it's not usually something we recommend if you're training a token-based model because you'll easily end up with annotations of spans that don't map to actual tokens and can't be predicted or embedded.

Topic		Replies	Views
Disable automatic selection of full word when using ner.correct recipe	3	139	February 23, 2024
Span annotation is missing the last character bug , front-end , spancat	2	428	December 3, 2021
Double-spaces preventing manual span annotations Getting Started	1	26	May 13, 2025
Fully manual NER annotations without tokeniser enhancement , ner , done	3	996	June 17, 2020
How to make more specific selection? usage , ner	1	247	January 18, 2023

Multiple issues with character based annotation

Related topics