whitespaces at the beginning of a line

MeirLevinNavina · September 29, 2021, 9:36am

Hi, I am currently using ner.manual for text with multiple whitespaces and new lines, the tagging behavior of both is as expected but we still challenges with presenting the text indented as we wish.
After adding honor_token_whitespace: false it looks much batter between tokens, but at newlines, it's still not as expected.

Original text:

Current result:

Expected result:

Is there any config or change that will display it as we expect?
Thanks.

ines · October 4, 2021, 9:37am

Hi! I just tested this and I think the underlying problem here is that the default tokenizer currently produces a single whitespace token for \n and trailing spaces. So the tokens you end up with under the hood look like this:

nlp = spacy.blank("en")
doc = nlp("hello          dear\n         world")
print([t.text for t in doc])
# ['hello', '         ', 'dear', '\n         ', 'world']

There are a few different ways you could work around it and it kinda depends on the rest of your process:

Provide a list of "tokens" yourself that includes two separate tokens for \n and the preceeding spaces. You could use Prodigy's add_tokens preprocessor to add the default tokens and then adjust them automatically in a postprocessing step (just make sure to adjust the token indices when you split the existing token).
Probably a bit more elegant: add a rule/component that runs after the tokenizer and uses doc.retokenize to split all tokens consisting of \n + spaces into two: https://spacy.io/api/doc#retokenizer.split
Hacky: Add an invisible/zero-width space (\u200b) in between. You'd probably just want to replace it again afterwards with a simple search and replace over the data.

MeirLevinNavina · October 5, 2021, 5:15pm

Thanks @ines for the detailed response !

Topic		Replies	Views
Preserve preceding whitespaces at the beginning of a line usage	1	436	October 5, 2021
Whitespace tokens not displaying for some reason	3	136	November 21, 2023
Adding newline and tabs to annotation interface usage , spacy , transformers	4	1508	November 13, 2020
Adding custom whitespace characters ner , spacy	1	489	January 6, 2020
display of tokens without spaces enhancement , ner , done , front-end	6	1846	June 17, 2020

whitespaces at the beginning of a line

Related topics