In my corpus, I’m seeing a lot of tokens for MONEY which look like the following:
I’d like to split the text into ['$', '8', '/hour'] and tokenize only the first two as MONEY, and train a new token class for the /hour. What is the best way to tell spaCy tokenizer to split ‘8/hour’, when the suffix could be ‘/day’, ‘/week’, ‘/job’, etc.?
Just a quick note on your solution though: If you create the tokenizer this way, you won’t be loading the normal language-specific rules, so your tokenization will be worse overall.
What you probably want to do is modify the English.infixes class-attribute before loading the model. During construction, the class attribute is read to create the infixes regex when creating the tokenizer.
The English.infixes attribute is a tuple of strings, which are built into regular expressions. So something like this should work:
from spacy.lang.en import English
English.Defaults.infixes = English.Defaults.infixes + [r'/[A-z]+]