Not sure if it’s just long tokens or only these particular character repetitions. When we highlight them, they line-wrap as expected. Prior to that, however, they look like this:
Thanks for the report, this should definitely render better! Are those long sequences one token or multiple tokens in a row?
Btw this thread has some more background on newlines in general, why the manual mode replaces them with symbols and the changes and new options that are coming in the next version.
Thanks!
Our custom tokenizer treats them as a single token; we don’t end up with the same problem if we split them as multiples, fwiw.
Ah, okay! Maybe Prodigy should just force line breaks for very long tokens. Alternatively, I’ll also experiment with a solution that adds a space behind the replacement symbols.
For your use case, you’d probably also want an option to turn off added “real” line breaks when we ship the feature I describe here, so you don’t end up with 3828342 newlines, right? I might add an option for this, just in case.
Yep, that sounds great.
Just released v1.5.0
, which includes the following fixes:
- newline characters within a token should now be separated by spaces, which means that long lines should wrap
- by default, newline tokens will now come with added line breaks – but you can turn this behaviour off by setting
"hide_true_newline_tokens": true
- plus everything in this post