Guidance on how to add tokenizer rule

sooheon · July 1, 2018, 7:31am

In my corpus, I’m seeing a lot of tokens for MONEY which look like the following:

I’d like to split the text into ['$', '8', '/hour'] and tokenize only the first two as MONEY, and train a new token class for the /hour. What is the best way to tell spaCy tokenizer to split ‘8/hour’, when the suffix could be ‘/day’, ‘/week’, ‘/job’, etc.?

sooheon · July 2, 2018, 7:13am

If anyone has the same question, here’s how I solved it:

spaCy has an argument to the Tokenizer() constructor (infix_finditer) which can take a regex to split all tokens on internally.

>>> infix_re = re.compile(r'/')
>>> tokenizer = Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
>>> [w for w in tokenizer('$8/hour')]
[$8, /, hour]

edit:
updated to not split dates in YYYY/MM/DD format:

>>> suffix_re = spacy.util.compile_suffix_regex(tuple(list(nlp.Defaults.suffixes) + [r'/[A-z]+']))
>>> tokenizer = Tokenizer(nlp.vocab, suffix_search=suffix_re.search)
>>>  [w for w in tokenizer('$80/wk starting 2018/06/25')]
[$80, /wk, starting, 2018/06/25]

honnibal · July 2, 2018, 11:58am

Thanks for updating!

Just a quick note on your solution though: If you create the tokenizer this way, you won’t be loading the normal language-specific rules, so your tokenization will be worse overall.

What you probably want to do is modify the English.infixes class-attribute before loading the model. During construction, the class attribute is read to create the infixes regex when creating the tokenizer.

The English.infixes attribute is a tuple of strings, which are built into regular expressions. So something like this should work:

from spacy.lang.en import English

English.Defaults.infixes = English.Defaults.infixes + [r'/[A-z]+]

More details here: https://spacy.io/usage/linguistic-features#section-tokenization

sooheon · July 3, 2018, 8:36am

Thanks, this is much simplified (although it seems to be English.Defaults.infixes rather than English.infixes)

Again for posterity:

import spacy
from spacy.lang.en import English

English.Defaults.infixes = English.Defaults.infixes + tuple([r'/(?=[A-z]+)'])

nlp = spacy.load('en_core_web_lg')

nlp.to_disk('./models/en_lg_custom')

Topic		Replies	Views
Add tokenization rule usage , spacy	4	738	May 15, 2020
Custom tokenization not recursive usage , spacy	0	394	June 5, 2020
Custom Tokenizer help ner , spacy	1	323	December 23, 2022
How to tell SpaCy not to split any intra-hyphen words? spacy , solved	6	9998	June 5, 2019
Infix rule ignored usage , spacy	0	355	March 19, 2020

Guidance on how to add tokenizer rule

Related topics