Hi! I just tested this and I think the underlying problem here is that the default tokenizer currently produces a single whitespace token for \n
and trailing spaces. So the tokens you end up with under the hood look like this:
nlp = spacy.blank("en")
doc = nlp("hello dear\n world")
print([t.text for t in doc])
# ['hello', ' ', 'dear', '\n ', 'world']
There are a few different ways you could work around it and it kinda depends on the rest of your process:
- Provide a list of
"tokens"
yourself that includes two separate tokens for\n
and the preceeding spaces. You could use Prodigy'sadd_tokens
preprocessor to add the default tokens and then adjust them automatically in a postprocessing step (just make sure to adjust the token indices when you split the existing token). - Probably a bit more elegant: add a rule/component that runs after the tokenizer and uses
doc.retokenize
to split all tokens consisting of\n
+ spaces into two: https://spacy.io/api/doc#retokenizer.split - Hacky: Add an invisible/zero-width space (
\u200b
) in between. You'd probably just want to replace it again afterwards with a simple search and replace over the data.