Custom tokenization not recursive

einarbmag · June 5, 2020, 1:27pm

Hi,
I have been trying to customize the spaCy tokenizer to solve a problem that I'm encountering in my data, where people are using "-" essentially as whitespace. So I figured I'd configure "[-]+" as an infix, but the problem is then if the resulting tokens contain suffix or prefix. It seems the tokenizer is not recursive, it stops trying to split things once it's done an infix-split. Example:

# I have abstracted this into a function following the spaCy documentation
tokenizer = create_custom_tokenizer(nlp, custom_infixes=['[-]+'], custom_suffixes=['\)'])
doc = tokenizer("I want to split this thing: (something)--to split")
[word for word in doc]

[OUT]: [I, want, to, split, this, thing, :, (, something), --, to, split]

As you can see, the ")" is not getting split off from "something", which confuses the entity recogniser (let's say "something" is an entity). Is there any solution to this other than adding another infix rule "\)[-]+"?

Topic		Replies	Views
Custom Tokenizer help ner , spacy	1	320	December 23, 2022
Guidance on how to add tokenizer rule spacy , solved	3	3385	July 3, 2018
Infix rule ignored usage , spacy	0	354	March 19, 2020
How to tell SpaCy not to split any intra-hyphen words? spacy , solved	6	9974	June 5, 2019
Wrong tokenization on commas preceded by a special character usage , spacy	5	1744	October 4, 2019

Custom tokenization not recursive

Related topics