Text with '+' instead of spaces in parts of the text

sorriluis · April 25, 2021, 4:54pm

I am trying to use Prodigy+Spacy for information retrieval of Spanish texts. Those texts are internal annotation from Customer Service Agents, and follow some kind of annotation tips. Some of the agents, when they are summarizing the final offer to the customer, use a + sign instead of the space. Something like:
FINAL OFFER: PRODUCTA+PRODUCTB+20% DISCOUNT+12 MONTHS ADDITIONAL SERVICE FINAL PRICE:23,20€

The challenge that I am facing as a newbie is PRODUCTA+PRODUCTB is one single token and I would like to be able to select only PRODUCTA and PRODUCTB.

I have been checking the documentation, and if my understanding is right, I should somehow change how Spacy tokenized by adding '+', but I want to be sure on the approach, and if this is going to be consistent with the approach.

Thanks in advance

SofieVL · April 25, 2021, 5:21pm

Hi Luis,

Would it be an option for you to run some preprocessing on the text to change any + sign into a space, even before you use spaCy and/or Prodigy? Then you wouldn't have to deal with custom tokenization, and you'd have cleaner input texts to begin with?

sorriluis · April 25, 2021, 6:02pm

Hi Sofie,
many thanks for the swift reply. I thought about it but got tempted by complexity.
Now that you mention it, it should be the best solution to avoid further complications with custom tokenization, and it will make it more consistent with agents changing their typing approach.

I write down the REGEXP pattern that I am using in the extraction (BigQuery)
REGEXP_REPLACE(DESCRIPTION, r'([(-z])+([(-z])',r'\1 + \2') as text

Thanks,

Topic		Replies	Views
Custom English Tokenizer usage , spacy	0	539	May 7, 2019
Add tokenization rule usage , spacy	4	750	May 15, 2020
Using a costume tokenizer while annotating using a built-in recipe (spans.manual)	2	45	September 4, 2024
Using Prodigy to annotate data and train a tokenizer, or to fix the default tokenizer. spacy , custom	4	1361	March 11, 2020
Custom Tokenizer help ner , spacy	1	332	December 23, 2022

Text with '+' instead of spaces in parts of the text

Related topics