Using spans to split tokens

Hi @adieyal,

Thanks for your post and welcome to the Prodigy community :wave:

The advice given in the post you mentioned looks still relevant!

Annotating the text with spans is a good approach to handle the splitting of tokens prior to entity extraction. This is especially useful when the patterns are varied and pre-processing using regular expressions can be challenging.

The idea is to create a new model that predicts the splits as entities, and then use this model to pre-process your text before you use it with your actual NER model. This way, you can use the model to "correct" the tokenization in a way that's more meaningful for your specific use-case.

Remember to train this model with examples of what you do and don't want to split, so it can learn to predict the correct boundaries.

Also check out the highlight characters docs. We recently added a "toggle" to turn on character highlighting too.

Hope this helps!