I'm working on creating customized sentence splitters and tokenization rules. One issue I'm running into is the use of Unicode whitespace (e.g. using "thin spaces" U+2009 and "EM spaces" U+2003). Iterating through document tokens with
for token in doc:, I see that there are several tokens that just contain the unicode spaces. Standard ASCII "spaces" are not in the token list as distinct tokens.
The issue is that this causes extra complexity when adding custom sentence boundaries (
So my question is: can I add these extra whitespace characters to a list of whitespace characters in an existing model for tokenization? Something like