If anyone has the same question, here’s how I solved it:
spaCy has an argument to the Tokenizer() constructor (infix_finditer
) which can take a regex to split all tokens on internally.
>>> infix_re = re.compile(r'/')
>>> tokenizer = Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
>>> [w for w in tokenizer('$8/hour')]
[$8, /, hour]
edit:
updated to not split dates in YYYY/MM/DD format:
>>> suffix_re = spacy.util.compile_suffix_regex(tuple(list(nlp.Defaults.suffixes) + [r'/[A-z]+']))
>>> tokenizer = Tokenizer(nlp.vocab, suffix_search=suffix_re.search)
>>> [w for w in tokenizer('$80/wk starting 2018/06/25')]
[$80, /wk, starting, 2018/06/25]