Hi @linb , my hunch here is that the BERT Tokenizer (BertWordPieceTokenizer) from HF ignores the newlines from the text, that's why they're not showing up in Prodigy. My suggestion is to either create a custom recipe and substitute that tokenizer with spaCy's default one. Of course, the alignments between your transformer model and spaCy's tokenizer may be different.
The above procedure might cause destructive tokenization, and might be harder to recover the original data from the tokenized output. If you want non-destructive tokenization, you can train a transformer-based pipeline with spaCy v3, which will take care of aligning the transformer tokenization out-of-the-box.