Wrong tokenization on commas preceded by a special character

revyTH · October 3, 2019, 10:03am

Hi!
I've noticed that when using the default tokenizer (from the en_core_web_sm model) I got wrong tokenization in sentences like the following:

Required languages: java,c#,javascript.

This gives the following tokens: Required languages : java , c#,javascript .

This does not happen with the following case:

Required languages: java,python,javascript.

Tokens: Required languages : java , python , javascript .

The problem seems to occur with any comma preceded by a special characte, e.g.:

java,c++,javascript -> java , c++,javascript
java,F#,javascript -> java , F#,javascript
java,XyZ@,javascript -> java , XyZ@,javascript

I've tried to add special cases to the tokenizer but it didn't make any difference:

import spacy
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer.add_special_case(u"C#", [{ORTH: u"C#"}])
nlp.tokenizer.add_special_case(u"c#", [{ORTH: u"c#"}])
doc = nlp("Required languages: java,c#,javascript.")
print([t.text for t in doc])

Any advice on how to fix this tokenization issue?

ines · October 3, 2019, 9:28pm

Hi! I think what you want to do here is add the , to the infixes, since you want the tokenizer to additionally split on commas within a string. (Your special case rules only tell it to preserve strings, which it already does). I just tired it out locally and the following worked for me:

infixes = nlp.Defaults.suffixes + (r",",)
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer
print([t.text for t in nlp("Required languages: java,python,javascript.")]                                                        
# ['Required', 'languages', ':', 'java', ',', 'python', ',', 'javascript', '.']

revyTH · October 4, 2019, 11:56am

Thanks! I've tried with your solution, now it splits correctly on subtokens commas, but still the tokenization is incorrect because it treats the '#' character as a unique token:

I've tried, according to the documentation, to add a token_match pattern for the programming languages like C# and F# that end with a # character in order to override the punctuation rules, but still no effect:

Screenshot 2019-10-04 at 13.48.39

What am I missing?

ines · October 4, 2019, 2:02pm

I guess you could remove the # from the infixes?

infixes = nlp.Defaults.infixes + (r",",)
infixes = tuple([i for i in infixes if i != "#"])
infix_regex = spacy.util.compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infix_regex.finditer
print([t.text for t in nlp("Required languages: java,python,c#,javascript")])
# ['Required', 'languages', ':', 'java', ',', 'python', ',', 'c#', ',', 'javascript']

revyTH · October 4, 2019, 3:06pm

That did the trick, thanks Ines!

Just a curiosity: why we are loading the default suffixes as the new infixes and not doing something like:

infixes = nlp.Defaults.infixes + (r",",)

The above didn't work, only the one with nlp.Defaults.suffixes worked, just wondering why

Thank you again for the help!

ines · October 4, 2019, 5:02pm

Sorry, that was a typo! I just fixed it above.

Topic		Replies	Views
Custom Tokenizer help ner , spacy	1	320	December 23, 2022
Infix rule ignored usage , spacy	0	354	March 19, 2020
spaCy Tokenization issue spacy , off-topic	1	391	August 17, 2021
ignore strings for dependency parser spacy , solved , dep	3	686	May 9, 2018
Spacy sentence tokenizer: custom abbreviations case insensitive spacy	3	1026	July 6, 2022

Wrong tokenization on commas preceded by a special character

Related topics