Using spans to split tokens

adieyal · November 29, 2023, 6:44am

Hi

Thank you for an excellent tool and an amazing support forum. I would appreciate advice regarding splitting tokens prior to entity extraction. Here is my sample input

LA MOLISANA 28 FUSILLI 500GCTN

I would like to split it as follows

LA MOLISANA 28 FUSILLI 500 G CTN

The patterns are quite varied and pre-processing using regular expressions can be fiddly since there are many instances where a semantic understanding of the text is needed to split correctly. Is annotating the text with spans the right way to do it?

I've read through this post which I think discusses the same requirement. It was written in 2021 and I wanted to check if the advice in the post was still relevant.

ryanwesslen · November 29, 2023, 8:55pm

Hi @adieyal,

Thanks for your post and welcome to the Prodigy community

The advice given in the post you mentioned looks still relevant!

Annotating the text with spans is a good approach to handle the splitting of tokens prior to entity extraction. This is especially useful when the patterns are varied and pre-processing using regular expressions can be challenging.

The idea is to create a new model that predicts the splits as entities, and then use this model to pre-process your text before you use it with your actual NER model. This way, you can use the model to "correct" the tokenization in a way that's more meaningful for your specific use-case.

Remember to train this model with examples of what you do and don't want to split, so it can learn to predict the correct boundaries.

Also check out the highlight characters docs. We recently added a "toggle" to turn on character highlighting too.

Hope this helps!

adieyal · November 30, 2023, 6:06am

It does - thanks. The highlight characters feature is very useful

Topic		Replies	Views
Partially Fixed: ner.batch-train's split_sentences does not properly handle tokens and spans ner , done	1	504	October 1, 2018
partial word as entity usage , ner , solved	2	392	December 16, 2019
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	553	March 27, 2020
Annotating strings without correct separation ner , best-practices	8	193	November 21, 2024
Mismatching spans usage , ner , solved	3	336	July 15, 2021

Using spans to split tokens

Related topics