Hi! If you're using Prodigy for manual span annotation, it will pre-tokenize the text so your selection can snap to the token boundaries. For most token-based annotation tasks (NER, POS tags), this is nice, because you don't have to hit the exact boundaries and can annotate much faster. It also lets you spot tokenization issues early because you can't really train token-based models on annotations that don't map to tokens.
However, it does mean that you can't just select half a token. If words with missing spaces appear only occasionally, you could use a separate label for them (e.g.
MESSY_SUBWORDS), highlight the tokens in question and annotate the rest, and then filter out all examples containing
MESSY_SUBWORDS afterwards. You can then add the missing spaces, or add the character offsets of the subwords (depending on what you need).
You could also stream in only the messy span texts one by one, add a
"tokens" field with one token per character and then highlight the individual subwords. This would show you something like:
f o r e x a m p l e l i k e t h i s – and you'd then highlight
If you want to train a model that predicts token-based tags on annotations that refer to partial tokens, those "subwords" should be individual tokens. Of course, it's always nice to do the segmentation programmatically, but you can also use the retokenizer to split tokens.