Text with '+' instead of spaces in parts of the text

I am trying to use Prodigy+Spacy for information retrieval of Spanish texts. Those texts are internal annotation from Customer Service Agents, and follow some kind of annotation tips. Some of the agents, when they are summarizing the final offer to the customer, use a + sign instead of the space. Something like:

The challenge that I am facing as a newbie is PRODUCTA+PRODUCTB is one single token and I would like to be able to select only PRODUCTA and PRODUCTB.

I have been checking the documentation, and if my understanding is right, I should somehow change how Spacy tokenized by adding '+', but I want to be sure on the approach, and if this is going to be consistent with the approach.

Thanks in advance

Hi Luis,

Would it be an option for you to run some preprocessing on the text to change any + sign into a space, even before you use spaCy and/or Prodigy? Then you wouldn't have to deal with custom tokenization, and you'd have cleaner input texts to begin with?

1 Like

Hi Sofie,
many thanks for the swift reply. I thought about it but got tempted by complexity. :sweat_smile:
Now that you mention it, it should be the best solution to avoid further complications with custom tokenization, and it will make it more consistent with agents changing their typing approach.

I write down the REGEXP pattern that I am using in the extraction (BigQuery)
REGEXP_REPLACE(DESCRIPTION, r'([(-z])+([(-z])',r'\1 + \2') as text