Hi there
I am having the very same issue, with newline \n characters not rendering with line breaks. It actually looks the same as @bboris.
@bboris did you find a solution?
Prodigy version is 1.8.1
Hi there
I am having the very same issue, with newline \n characters not rendering with line breaks. It actually looks the same as @bboris.
@bboris did you find a solution?
Prodigy version is 1.8.1
It seems like the problem in those cases is that the newlines arenβt separate tokens but rather, part of a token. For example, you might have two newlines in one token. I guess one simple thing you could try is to pre-process the text and add a space between the \n\n to ensure they become separate tokens.
Hi Ines,
I have come across the same problem mentioned in this thread: prodigy doesn't display consecutive new lines when I have ' \n\n'. For 1 single newline, it does work.
While preprocessing the text, I have added a space in between the new lines, so that they are now: '\n \n' but Prodigy still doesn't show the new line. Could it be that this is still being converted to 1 single token?
Prodigy version is 1.6, using the recipe ner.manual with "hide_true_newline_tokens": False
Thanks
That's possible, yes! ![]()
I just tested it locally and the following works for me:
import spacy
from spacy.util import compile_infix_regex
nlp = spacy.load("en_core_web_sm")
infixes = nlp.Defaults.infixes + (r'''\n''',)
infixes_regex = compile_infix_regex(infixes)
nlp.tokenizer.infix_finditer = infixes_regex.finditer
doc = nlp("Hello\n\nworld")
print([token.text for token in doc]) # ['Hello', '\n', '\n', 'world']
nlp.to_disk("/path/to/updated-model")
This adds a rule to the tokenizer treating \n as an infix and will split it if it occurs within a string. Modification the nlp object's tokenizer will be serialized with it when you save it to a directory. You can then use that directory as the input model in Prodigy instead of en_core_web_sm etc. and your custom tokenization should be applied to the incoming text.
(This is btw one of the reasons why the ner.manual recipe takes a model for tokenization β it should make it a bit easier to load in custom models with modified rules.)
Thanks! I should run this Python script once and then then model "en_core_web_sm" will always tokenize '\n' as a single token, right? And for annotating, I just call the recipe with the same model name "en_core_web_sm"?
Yes, ideally you'd be running this script once to save out a new custom model with updated tokenization. That model will be saved to a directory β in my example, I used a dummy path /path/to/updated-model. Instead of en_core_web_sm, you'd then pass that model directory to Prodigy:
prodigy ner.manual your_dataset /path/to/updated-model your_data.jsonl --label LABEL_ONE,LABEL_TWO
Thanks Ines. I confirm this solved the issue and double \n is now rendered with a new line in the Prodigy interface.
However, I was getting an error appending to the infixes list and changed to:
infixes = nlp.Defaults.infixes.copy()
infixes.append(r'''\n''')