Using spans to split tokens

ryanwesslen · November 29, 2023, 8:55pm

Thanks for your post and welcome to the Prodigy community

The advice given in the post you mentioned looks still relevant!

Annotating the text with spans is a good approach to handle the splitting of tokens prior to entity extraction. This is especially useful when the patterns are varied and pre-processing using regular expressions can be challenging.

The idea is to create a new model that predicts the splits as entities, and then use this model to pre-process your text before you use it with your actual NER model. This way, you can use the model to "correct" the tokenization in a way that's more meaningful for your specific use-case.

Remember to train this model with examples of what you do and don't want to split, so it can learn to predict the correct boundaries.

Also check out the highlight characters docs. We recently added a "toggle" to turn on character highlighting too.

Hope this helps!

Topic		Replies	Views
Partially Fixed: ner.batch-train's split_sentences does not properly handle tokens and spans ner , done	1	504	October 1, 2018
partial word as entity usage , ner , solved	2	392	December 16, 2019
Matching tokenisation on pre-existing annotated data usage , ner , spacy , solved	2	553	March 27, 2020
Annotating strings without correct separation ner , best-practices	8	193	November 21, 2024
Mismatching spans usage , ner , solved	3	336	July 15, 2021

Using spans to split tokens

Related topics