Adding custom whitespace characters

kbarresi · January 2, 2020, 4:33pm

Hello,

I'm working on creating customized sentence splitters and tokenization rules. One issue I'm running into is the use of Unicode whitespace (e.g. using "thin spaces" U+2009 and "EM spaces" U+2003). Iterating through document tokens with for token in doc:, I see that there are several tokens that just contain the unicode spaces. Standard ASCII "spaces" are not in the token list as distinct tokens.

The issue is that this causes extra complexity when adding custom sentence boundaries (token.is_sent_start).

So my question is: can I add these extra whitespace characters to a list of whitespace characters in an existing model for tokenization? Something like nlp.tokenizer.add_whitespace_char(...)?

honnibal · January 6, 2020, 2:28pm

This is actually expected behaviour in spaCy: whitespace is preserved, because we ensure that you can join all the tokens together and any trailing spaces they have, you get back the original text. We actually don't keep a copy of the text: we just keep hashes of the words, and a boolean indicating whether the token owns a trailing whitespace.

Could you just preprocess your text to remove the unicode spaces?

Topic		Replies	Views
whitespaces at the beginning of a line usage , ner , spacy	2	552	October 5, 2021
display of tokens without spaces enhancement , ner , done , front-end	6	1844	June 17, 2020
Preserve preceding whitespaces at the beginning of a line usage	1	435	October 5, 2021
Deberta custom tokens are all joined (no spaces). ner , front-end	1	16	November 4, 2024
Whitespace tokens not displaying for some reason	3	135	November 21, 2023

Adding custom whitespace characters

Related topics