Using spans to split tokens

Hi

Thank you for an excellent tool and an amazing support forum. I would appreciate advice regarding splitting tokens prior to entity extraction. Here is my sample input

LA MOLISANA 28 FUSILLI 500GCTN

I would like to split it as follows

LA MOLISANA 28 FUSILLI 500 G CTN

The patterns are quite varied and pre-processing using regular expressions can be fiddly since there are many instances where a semantic understanding of the text is needed to split correctly. Is annotating the text with spans the right way to do it?

I've read through this post which I think discusses the same requirement. It was written in 2021 and I wanted to check if the advice in the post was still relevant.

Hi @adieyal,

Thanks for your post and welcome to the Prodigy community :wave:

The advice given in the post you mentioned looks still relevant!

Annotating the text with spans is a good approach to handle the splitting of tokens prior to entity extraction. This is especially useful when the patterns are varied and pre-processing using regular expressions can be challenging.

The idea is to create a new model that predicts the splits as entities, and then use this model to pre-process your text before you use it with your actual NER model. This way, you can use the model to "correct" the tokenization in a way that's more meaningful for your specific use-case.

Remember to train this model with examples of what you do and don't want to split, so it can learn to predict the correct boundaries.

Also check out the highlight characters docs. We recently added a "toggle" to turn on character highlighting too.

Hope this helps!

It does - thanks. The highlight characters feature is very useful