Hi!
I've noticed that when using the default tokenizer (from the en_core_web_sm
model) I got wrong tokenization in sentences like the following:
Required languages: java,c#,javascript.
This gives the following tokens: Required
languages
:
java
,
c#,javascript
.
This does not happen with the following case:
Required languages: java,python,javascript.
Tokens: Required
languages
:
java
,
python
,
javascript
.
The problem seems to occur with any comma preceded by a special characte, e.g.:
java,c++,javascript
-> java
,
c++,javascript
java,F#,javascript
-> java
,
F#,javascript
java,XyZ@,javascript
-> java
,
XyZ@,javascript
I've tried to add special cases to the tokenizer but it didn't make any difference:
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer.add_special_case(u"C#", [{ORTH: u"C#"}])
nlp.tokenizer.add_special_case(u"c#", [{ORTH: u"c#"}])
doc = nlp("Required languages: java,c#,javascript.")
print([t.text for t in doc])
Any advice on how to fix this tokenization issue?