Adding custom whitespace characters


I'm working on creating customized sentence splitters and tokenization rules. One issue I'm running into is the use of Unicode whitespace (e.g. using "thin spaces" U+2009 and "EM spaces" U+2003). Iterating through document tokens with for token in doc:, I see that there are several tokens that just contain the unicode spaces. Standard ASCII "spaces" are not in the token list as distinct tokens.

The issue is that this causes extra complexity when adding custom sentence boundaries (token.is_sent_start).

So my question is: can I add these extra whitespace characters to a list of whitespace characters in an existing model for tokenization? Something like nlp.tokenizer.add_whitespace_char(...)?

This is actually expected behaviour in spaCy: whitespace is preserved, because we ensure that you can join all the tokens together and any trailing spaces they have, you get back the original text. We actually don't keep a copy of the text: we just keep hashes of the words, and a boolean indicating whether the token owns a trailing whitespace.

Could you just preprocess your text to remove the unicode spaces?