whitespaces at the beginning of a line

Hi! I just tested this and I think the underlying problem here is that the default tokenizer currently produces a single whitespace token for \n and trailing spaces. So the tokens you end up with under the hood look like this:

nlp = spacy.blank("en")
doc = nlp("hello          dear\n         world")
print([t.text for t in doc])
# ['hello', '         ', 'dear', '\n         ', 'world']

There are a few different ways you could work around it and it kinda depends on the rest of your process:

  • Provide a list of "tokens" yourself that includes two separate tokens for \n and the preceeding spaces. You could use Prodigy's add_tokens preprocessor to add the default tokens and then adjust them automatically in a postprocessing step (just make sure to adjust the token indices when you split the existing token).
  • Probably a bit more elegant: add a rule/component that runs after the tokenizer and uses doc.retokenize to split all tokens consisting of \n + spaces into two: https://spacy.io/api/doc#retokenizer.split
  • Hacky: Add an invisible/zero-width space (\u200b) in between. You'd probably just want to replace it again afterwards with a simple search and replace over the data.