Guidance on how to add tokenizer rule

If anyone has the same question, here’s how I solved it:

spaCy has an argument to the Tokenizer() constructor (infix_finditer) which can take a regex to split all tokens on internally.

>>> infix_re = re.compile(r'/')
>>> tokenizer = Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
>>> [w for w in tokenizer('$8/hour')]
[$8, /, hour]

edit:
updated to not split dates in YYYY/MM/DD format:

>>> suffix_re = spacy.util.compile_suffix_regex(tuple(list(nlp.Defaults.suffixes) + [r'/[A-z]+']))
>>> tokenizer = Tokenizer(nlp.vocab, suffix_search=suffix_re.search)
>>>  [w for w in tokenizer('$80/wk starting 2018/06/25')]
[$80, /wk, starting, 2018/06/25]
1 Like