I have been trying to customize the spaCy tokenizer to solve a problem that I'm encountering in my data, where people are using "-" essentially as whitespace. So I figured I'd configure "[-]+" as an infix, but the problem is then if the resulting tokens contain suffix or prefix. It seems the tokenizer is not recursive, it stops trying to split things once it's done an infix-split. Example:
# I have abstracted this into a function following the spaCy documentation tokenizer = create_custom_tokenizer(nlp, custom_infixes=['[-]+'], custom_suffixes=['\)']) doc = tokenizer("I want to split this thing: (something)--to split") [word for word in doc] [OUT]: [I, want, to, split, this, thing, :, (, something), --, to, split]
As you can see, the ")" is not getting split off from "something", which confuses the entity recogniser (let's say "something" is an entity). Is there any solution to this other than adding another infix rule "\)[-]+"?