Prodigy version is 1.8.1
Prodigy version is 1.8.1
It seems like the problem in those cases is that the newlines aren’t separate tokens but rather, part of a token. For example, you might have two newlines in one token. I guess one simple thing you could try is to pre-process the text and add a space between the
\n\n to ensure they become separate tokens.
I have come across the same problem mentioned in this thread: prodigy doesn't display consecutive new lines when I have ' \n\n'. For 1 single newline, it does work.
While preprocessing the text, I have added a space in between the new lines, so that they are now: '\n \n' but Prodigy still doesn't show the new line. Could it be that this is still being converted to 1 single token?
Prodigy version is 1.6, using the recipe ner.manual with "hide_true_newline_tokens": False
That's possible, yes!
I just tested it locally and the following works for me:
import spacy from spacy.util import compile_infix_regex nlp = spacy.load("en_core_web_sm") infixes = nlp.Defaults.infixes + (r'''\n''',) infixes_regex = compile_infix_regex(infixes) nlp.tokenizer.infix_finditer = infixes_regex.finditer doc = nlp("Hello\n\nworld") print([token.text for token in doc]) # ['Hello', '\n', '\n', 'world'] nlp.to_disk("/path/to/updated-model")
This adds a rule to the tokenizer treating
\n as an infix and will split it if it occurs within a string. Modification the
nlp object's tokenizer will be serialized with it when you save it to a directory. You can then use that directory as the input model in Prodigy instead of
en_core_web_sm etc. and your custom tokenization should be applied to the incoming text.
(This is btw one of the reasons why the
ner.manual recipe takes a model for tokenization – it should make it a bit easier to load in custom models with modified rules.)
Thanks! I should run this Python script once and then then model "en_core_web_sm" will always tokenize '\n' as a single token, right? And for annotating, I just call the recipe with the same model name "en_core_web_sm"?
Yes, ideally you'd be running this script once to save out a new custom model with updated tokenization. That model will be saved to a directory – in my example, I used a dummy path
/path/to/updated-model. Instead of
en_core_web_sm, you'd then pass that model directory to Prodigy:
prodigy ner.manual your_dataset /path/to/updated-model your_data.jsonl --label LABEL_ONE,LABEL_TWO
Thanks Ines. I confirm this solved the issue and double \n is now rendered with a new line in the Prodigy interface.
However, I was getting an error appending to the infixes list and changed to:
infixes = nlp.Defaults.infixes.copy() infixes.append(r'''\n''')