whitespaces at the beginning of a line

Hi, I am currently using ner.manual for text with multiple whitespaces and new lines, the tagging behavior of both is as expected but we still challenges with presenting the text indented as we wish.
After adding honor_token_whitespace: false it looks much batter between tokens, but at newlines, it's still not as expected.

Original text:
image

Current result:
image

Expected result:
image

Is there any config or change that will display it as we expect?
Thanks.

Hi! I just tested this and I think the underlying problem here is that the default tokenizer currently produces a single whitespace token for \n and trailing spaces. So the tokens you end up with under the hood look like this:

nlp = spacy.blank("en")
doc = nlp("hello          dear\n         world")
print([t.text for t in doc])
# ['hello', '         ', 'dear', '\n         ', 'world']

There are a few different ways you could work around it and it kinda depends on the rest of your process:

  • Provide a list of "tokens" yourself that includes two separate tokens for \n and the preceeding spaces. You could use Prodigy's add_tokens preprocessor to add the default tokens and then adjust them automatically in a postprocessing step (just make sure to adjust the token indices when you split the existing token).
  • Probably a bit more elegant: add a rule/component that runs after the tokenizer and uses doc.retokenize to split all tokens consisting of \n + spaces into two: https://spacy.io/api/doc#retokenizer.split
  • Hacky: Add an invisible/zero-width space (\u200b) in between. You'd probably just want to replace it again afterwards with a simple search and replace over the data.

Thanks @ines for the detailed response !