Segmentation and newlines in ner.manual

arnaud · July 9, 2019, 9:57am

Hi there

I am having the very same issue, with newline \n characters not rendering with line breaks. It actually looks the same as @bboris.
@bboris did you find a solution?

Prodigy version is 1.8.1

ines · July 9, 2019, 5:36pm

It seems like the problem in those cases is that the newlines aren’t separate tokens but rather, part of a token. For example, you might have two newlines in one token. I guess one simple thing you could try is to pre-process the text and add a space between the \n\n to ensure they become separate tokens.

nuno · August 8, 2019, 11:05am

Hi Ines,

I have come across the same problem mentioned in this thread: prodigy doesn't display consecutive new lines when I have ' \n\n'. For 1 single newline, it does work.

While preprocessing the text, I have added a space in between the new lines, so that they are now: '\n \n' but Prodigy still doesn't show the new line. Could it be that this is still being converted to 1 single token?

Prodigy version is 1.6, using the recipe ner.manual with "hide_true_newline_tokens": False

Thanks

ines · August 8, 2019, 11:33am

That's possible, yes!

I just tested it locally and the following works for me:

import spacy
from spacy.util import compile_infix_regex

nlp = spacy.load("en_core_web_sm")
infixes = nlp.Defaults.infixes + (r'''\n''',) 
infixes_regex = compile_infix_regex(infixes) 
nlp.tokenizer.infix_finditer = infixes_regex.finditer   
doc = nlp("Hello\n\nworld")
print([token.text for token in doc])  # ['Hello', '\n', '\n', 'world']

nlp.to_disk("/path/to/updated-model")

This adds a rule to the tokenizer treating \n as an infix and will split it if it occurs within a string. Modification the nlp object's tokenizer will be serialized with it when you save it to a directory. You can then use that directory as the input model in Prodigy instead of en_core_web_sm etc. and your custom tokenization should be applied to the incoming text.

(This is btw one of the reasons why the ner.manual recipe takes a model for tokenization – it should make it a bit easier to load in custom models with modified rules.)

nuno · August 8, 2019, 1:11pm

Thanks! I should run this Python script once and then then model "en_core_web_sm" will always tokenize '\n' as a single token, right? And for annotating, I just call the recipe with the same model name "en_core_web_sm"?

ines · August 8, 2019, 1:14pm

Yes, ideally you'd be running this script once to save out a new custom model with updated tokenization. That model will be saved to a directory – in my example, I used a dummy path /path/to/updated-model. Instead of en_core_web_sm, you'd then pass that model directory to Prodigy:

prodigy ner.manual your_dataset /path/to/updated-model your_data.jsonl --label LABEL_ONE,LABEL_TWO

nuno · August 14, 2019, 10:17am

Thanks Ines. I confirm this solved the issue and double \n is now rendered with a new line in the Prodigy interface.

However, I was getting an error appending to the infixes list and changed to:

infixes = nlp.Defaults.infixes.copy()
infixes.append(r'''\n''')

Topic		Replies	Views
✨ Demo: fully manual NER annotation interface enhancement , ner , front-end	48	6924	February 2, 2018
NER document Labeling ner , solved	25	3763	August 1, 2019
Re-use UI elements usage , front-end	8	1003	February 18, 2019
Does prodigy treat new line chars as escape sequences when displayed in the annotation tool?	2	199	January 2, 2024
ner.manual order of texts usage , ner , done , solved	10	1243	January 2, 2020

Segmentation and newlines in ner.manual

Related topics