Hi! I just tested this and I think the underlying problem here is that the default tokenizer currently produces a single whitespace token for
\n and trailing spaces. So the tokens you end up with under the hood look like this:
nlp = spacy.blank("en")
doc = nlp("hello dear\n world")
print([t.text for t in doc])
# ['hello', ' ', 'dear', '\n ', 'world']
There are a few different ways you could work around it and it kinda depends on the rest of your process:
- Provide a list of
"tokens" yourself that includes two separate tokens for
\n and the preceeding spaces. You could use Prodigy's
add_tokens preprocessor to add the default tokens and then adjust them automatically in a postprocessing step (just make sure to adjust the token indices when you split the existing token).
- Probably a bit more elegant: add a rule/component that runs after the tokenizer and uses
doc.retokenize to split all tokens consisting of
\n + spaces into two: https://spacy.io/api/doc#retokenizer.split
- Hacky: Add an invisible/zero-width space (
\u200b) in between. You'd probably just want to replace it again afterwards with a simple search and replace over the data.