Segmentation and newlines in ner.manual

Hi there

I am having the very same issue, with newline \n characters not rendering with line breaks. It actually looks the same as @bboris.
@bboris did you find a solution?

Prodigy version is 1.8.1

It seems like the problem in those cases is that the newlines aren’t separate tokens but rather, part of a token. For example, you might have two newlines in one token. I guess one simple thing you could try is to pre-process the text and add a space between the \n\n to ensure they become separate tokens.

Hi Ines,

I have come across the same problem mentioned in this thread: prodigy doesn't display consecutive new lines when I have ' \n\n'. For 1 single newline, it does work.

While preprocessing the text, I have added a space in between the new lines, so that they are now: '\n \n' but Prodigy still doesn't show the new line. Could it be that this is still being converted to 1 single token?

Prodigy version is 1.6, using the recipe ner.manual with "hide_true_newline_tokens": False

Thanks

That's possible, yes! :disappointed:

I just tested it locally and the following works for me:

import spacy
from spacy.util import compile_infix_regex

nlp = spacy.load("en_core_web_sm")
infixes = nlp.Defaults.infixes + (r'''\n''',) 
infixes_regex = compile_infix_regex(infixes) 
nlp.tokenizer.infix_finditer = infixes_regex.finditer   
doc = nlp("Hello\n\nworld")
print([token.text for token in doc])  # ['Hello', '\n', '\n', 'world']

nlp.to_disk("/path/to/updated-model")

This adds a rule to the tokenizer treating \n as an infix and will split it if it occurs within a string. Modification the nlp object's tokenizer will be serialized with it when you save it to a directory. You can then use that directory as the input model in Prodigy instead of en_core_web_sm etc. and your custom tokenization should be applied to the incoming text.

(This is btw one of the reasons why the ner.manual recipe takes a model for tokenization – it should make it a bit easier to load in custom models with modified rules.)

Thanks! I should run this Python script once and then then model "en_core_web_sm" will always tokenize '\n' as a single token, right? And for annotating, I just call the recipe with the same model name "en_core_web_sm"?

Yes, ideally you'd be running this script once to save out a new custom model with updated tokenization. That model will be saved to a directory – in my example, I used a dummy path /path/to/updated-model. Instead of en_core_web_sm, you'd then pass that model directory to Prodigy:

prodigy ner.manual your_dataset /path/to/updated-model your_data.jsonl --label LABEL_ONE,LABEL_TWO
1 Like

Thanks Ines. I confirm this solved the issue and double \n is now rendered with a new line in the Prodigy interface.

However, I was getting an error appending to the infixes list and changed to:

infixes = nlp.Defaults.infixes.copy()
infixes.append(r'''\n''')