whitespaces at the beginning of a line

ines · October 4, 2021, 9:37am

Hi! I just tested this and I think the underlying problem here is that the default tokenizer currently produces a single whitespace token for \n and trailing spaces. So the tokens you end up with under the hood look like this:

nlp = spacy.blank("en")
doc = nlp("hello          dear\n         world")
print([t.text for t in doc])
# ['hello', '         ', 'dear', '\n         ', 'world']

There are a few different ways you could work around it and it kinda depends on the rest of your process:

Provide a list of "tokens" yourself that includes two separate tokens for \n and the preceeding spaces. You could use Prodigy's add_tokens preprocessor to add the default tokens and then adjust them automatically in a postprocessing step (just make sure to adjust the token indices when you split the existing token).
Probably a bit more elegant: add a rule/component that runs after the tokenizer and uses doc.retokenize to split all tokens consisting of \n + spaces into two: https://spacy.io/api/doc#retokenizer.split
Hacky: Add an invisible/zero-width space (\u200b) in between. You'd probably just want to replace it again afterwards with a simple search and replace over the data.

Topic		Replies	Views
Preserve preceding whitespaces at the beginning of a line usage	1	435	October 5, 2021
Whitespace tokens not displaying for some reason	3	135	November 21, 2023
Adding custom whitespace characters ner , spacy	1	486	January 6, 2020
display of tokens without spaces enhancement , ner , done , front-end	6	1844	June 17, 2020
Deberta custom tokens are all joined (no spaces). ner , front-end	1	16	November 4, 2024

whitespaces at the beginning of a line

Related topics