Guidance on how to add tokenizer rule

sooheon · July 2, 2018, 7:13am

If anyone has the same question, here’s how I solved it:

spaCy has an argument to the Tokenizer() constructor (infix_finditer) which can take a regex to split all tokens on internally.

>>> infix_re = re.compile(r'/')
>>> tokenizer = Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
>>> [w for w in tokenizer('$8/hour')]
[$8, /, hour]

edit:
updated to not split dates in YYYY/MM/DD format:

>>> suffix_re = spacy.util.compile_suffix_regex(tuple(list(nlp.Defaults.suffixes) + [r'/[A-z]+']))
>>> tokenizer = Tokenizer(nlp.vocab, suffix_search=suffix_re.search)
>>>  [w for w in tokenizer('$80/wk starting 2018/06/25')]
[$80, /wk, starting, 2018/06/25]

Topic		Replies	Views
Custom Tokenizer help ner , spacy	1	317	December 23, 2022
Infix rule ignored usage , spacy	0	354	March 19, 2020
Training a new parser or NER using a model with no lexeme normalization table. This may degrade the performance of the model to some degree... usage , spacy	4	2203	November 2, 2020
How to tell SpaCy not to split any intra-hyphen words? spacy , solved	6	9867	June 5, 2019
spaCy Tokenization issue spacy , off-topic	1	389	August 17, 2021

Guidance on how to add tokenizer rule

Related topics