Guidance on how to add tokenizer rule

In my corpus, I’m seeing a lot of tokens for MONEY which look like the following: 30

I’d like to split the text into ['$', '8', '/hour'] and tokenize only the first two as MONEY, and train a new token class for the /hour. What is the best way to tell spaCy tokenizer to split ‘8/hour’, when the suffix could be ‘/day’, ‘/week’, ‘/job’, etc.?

1 Like

If anyone has the same question, here’s how I solved it:

spaCy has an argument to the Tokenizer() constructor (infix_finditer) which can take a regex to split all tokens on internally.

>>> infix_re = re.compile(r'/')
>>> tokenizer = Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
>>> [w for w in tokenizer('$8/hour')]
[$8, /, hour]

updated to not split dates in YYYY/MM/DD format:

>>> suffix_re = spacy.util.compile_suffix_regex(tuple(list(nlp.Defaults.suffixes) + [r'/[A-z]+']))
>>> tokenizer = Tokenizer(nlp.vocab,
>>>  [w for w in tokenizer('$80/wk starting 2018/06/25')]
[$80, /wk, starting, 2018/06/25]
1 Like

Thanks for updating!

Just a quick note on your solution though: If you create the tokenizer this way, you won’t be loading the normal language-specific rules, so your tokenization will be worse overall.

What you probably want to do is modify the English.infixes class-attribute before loading the model. During construction, the class attribute is read to create the infixes regex when creating the tokenizer.

The English.infixes attribute is a tuple of strings, which are built into regular expressions. So something like this should work:

from spacy.lang.en import English

English.Defaults.infixes = English.Defaults.infixes + [r'/[A-z]+]

More details here:

Thanks, this is much simplified (although it seems to be English.Defaults.infixes rather than English.infixes)

Again for posterity:

import spacy
from spacy.lang.en import English

English.Defaults.infixes = English.Defaults.infixes + tuple([r'/(?=[A-z]+)'])

nlp = spacy.load('en_core_web_lg')

1 Like