Segmentation and newlines in ner.manual

Probably you can combine and real \n, i.e. render instead of \n (as now) + add visual-only linebreak. This looks like good trade-off (and can be easily configured in prodigy.json for example).

1 Like

When I create a custom recipe based on ner.manual, the output explicitly shows newline characters when rendered. For my current task it would be better if the text were just displayed normally without visible newlines, and the rendered document actually just continues on the next line. Is this a simple configuration change I’m missing? Apologies in advance if I’m overlooking the answer in an obvious place in the docs or on the forum (although I did search both, I swear!).

@KMLDS I merged your thread onto this one, because I remembered the newline discussion here – but it was really hidden away in the comments.

See my comments above for some background on why ner.manual in particular needs at least some character-based representation of newlines. However, I’ve been experimenting with the solution I suggested above, which is to add a line break after the indicator.

How relevant are the newlines or newline tokens to what you’re doing? If you don’t need them in your training data, one solution could be to add a preprocessing step that removes them from your text. (Just keep in mind that you probably want to preprocess your runtime inputs the same way, especially if you’re using spaCy, which will preserve double whitespace as individual tokens. If your model was trained on data that never included whitespace tokens, and it suddenly encounters them at runtime, this might lead to unexpected results.)

Thanks @ines - I assumed that was the reason for displaying whitespace characters. In my case, I will have a couple of subject matter experts doing some labeling of documents with a familiar format to them. The important thing I’m missing with the current rendering is the visual cues from paragraph breaks and bulleted lists and similar. If I either just remove the whitespace or break my training examples down into smaller chunks (e.g. just showing text between ‘\n\n’ tokens), it will take them much longer to go through the documents we want to label.

For later modeling efforts on this task there is no semantic difference between ‘\n’ and ’ ', and it doesn’t really matter to me if trailing or preceding ‘\n\n’ tokens are captured (I can just remove them from training data or model outputs, they have no importance to the task at hand).

@KMLDS Thanks for sharing your use case – that’s pretty interesting and I see your point about formatting the data as a list or adding other visual hints. Something similar also came up on this thread. I’ve shared some of my thoughts and the possible complications around this on that thread as well. I still don’t have a perfect solution in mind, but I’m sure we can come up with something that works across use cases!

Hi @ines, I’m coming across the same issue as @KMLDS - basically that the visual line breaks in the text provide useful cues during the labeling process with a ner.manual recipe. For me, it would be great to have both the ↵ and the visual line break in the web app.

Quick update: I tested the " plus line break" solution and it's been working well – so we will be able to ship this update with the next release :tada:

I've also been experimenting with solutions for use cases like this one and how to allow adding more visual clues to the manual interface:

In the upcoming version, you'll be able to mark individual "tokens" in the input data as "disabled": true. This will render them in grey and will prevent the user from selecting those tokens (or any text spanning across them). The disabled tokens can be used for whitespace characters, list bullets and other tokens purely intended for formatting, and they can also help the annotator identify what's important quicker. You can also use them to prevent highlighting mistakes (e.g. by disabling all newline tokens to not allow entities spanning over two paragraphs). The "disabled" property can also make it easier later on to separate annotator-only markup from the annotated text.

1 Like

@blakey @KMLDS @menshikh-iv @bhanu

Just released v1.5.0, which includes the fixes I described above:

  • Newlines in manual mode are now rendered as plus line break.
  • To disable this behaviour (e.g. if your text contains lots of newlines like in this example), you can set "hide_true_newline_tokens": true.
  • You can now mark individual tokens as "disabled": true, which will render them in grey and will prevent the user from selecting them. This may be a nice solution for use cases similar to the one described by @KMLDS, where the text should be enhanced with formatting markup (lists, line breaks etc.) to help the annotator
3 Likes

Hi Ines, I am using Prodigy 1.5.1 and still seeing the ↵ characters without a line break for manual labeling. Is there a configuration I need to set or should this work by default?

Thank you!
Erik

@ecallen7979 Hmm, that’s strange – let me look into this! It should work without requiring additional config, but maybe something isn’t running as expected here.

Thank you!

@ecallen7979 I just tested it and I can’t seems to reproduce this :thinking: Do you have an example text?

This is definitely the expected rendering:

The "hide_true_newline_tokens" settings lets you enable hidden newlines in your config, but it should default to false.

Hi Ines

Got similar situation.
This json example:

{"text": "\nGSR-1-PE-5# show controller fia\n\nFabric configuration: 10Gbps bandwidth (2.4Gbps available), redundant fabric\n\nMaster Scheduler: Slot 17 Backup Scheduler: Slot 16\n\nFab epoch no 0 Halt count 0\n\nFrom Fabric FIA Errors\n\n\-----------------------\n\nredund overflow 0 cell drops 0\n\ncell parity 0\n\nSwitch cards present 0x001F Slots 16 17 18 19 20\n\nSwitch cards monitored 0x001F Slots 16 17 18 19 20\n\nSlot: 16 17 18 19 20\n\nName: csc0 csc1 sfc0 sfc1 sfc2\n\n\-------- \-------- \-------- \-------- \--------\n\nlos 0 0 0 0 0\n\nstate Off Off Off Off Off\n\ncrc16 0 0 0 0 0\n\nTo Fabric FIA Errors\n\n\-----------------------\n\nsca not pres 0 req error 0 uni fifo overflow 0\n\ngrant parity 0 multi req 0 uni fifo undrflow 0\n\ncntrl parity 0 uni req 0\n\nmulti fifo 0 empty dst req 0 handshake error 0\n\ncell parity 0\n\nGSR-1-PE-5# attach 1\n\nEntering Console for Modular SPA Interface Card in Slot: 1\n\nType "exit" to end this session\n\nPress RETURN to get started!\n\nLC-Slot1>en\n\nLC-Slot1# test fab\n\nBFLC diagnostic console program\n\nBFLC (? for help) [?]: qm_sanity_debug\n\nQM Sanity Debug enabled\n\nBFLC (? for help) [qm_sanity_debug]:\n\nSLOT 1:02:54:33: ToFAB BMA information\n\nSLOT 1:02:54:33: Number of FreeQs carved 4\n\nSLOT 1:02:54:33: Pool 1: Carve Size 102001: Current Size 102001\n\nSLOT 1:02:54:33: Pool 2: Carve Size 78462: Current Size 78462\n\nSLOT 1:02:54:33: Pool 3: Carve Size 57539: Current Size 57539\n\nSLOT 1:02:54:33: Pool 4: Carve Size 22870: Current Size 22870\n\nSLOT 1:02:54:33: IPC FreeQ: Carve Size 600: Current Size 600\n\nSLOT 1:02:54:33: Number of LOQs enabled 768\n\nSLOT 1:02:54:33: Number of LOQs disabled 1280\n\nSLOT 1:02:54:33: ToFAB BMA information\n\nSLOT 1:02:54:33: Number of FreeQs carved 4\n\nSLOT 1:02:54:33: Pool 1: Carve Size 102001: Current Size 102001\n\nSLOT 1:02:54:33: Pool 2: Carve Size 78462: Current Size 78462\n\nSLOT 1:02:54:33: Pool 3: Carve Size 57539: Current Size 57539\n\nSLOT 1:02:54:33: Pool 4: Carve Size 22870: Current Size 22870\n\nSLOT 1:02:54:33: IPC FreeQ: Carve Size 600: Current Size 600\n\nSLOT 1:02:54:33: Number of LOQs enabled 768\n\nSLOT 1:02:54:33: Number of LOQs disabled 1280\n\nQM Sanity Debug disabled\n\nBFLC (? for help) [qm_sanity_debug]: qm_sanity_info\n\nToFab QM Sanity level Warning\n\nFrFab QM Sanity level None\n\nSanity Check is triggered every 20 seconds\n\nMin. "}

Renders like this:

running prodigy 1.5.1 and "hide_true_newline_tokens" set to false explicitly.

Thank you

Hi there

I am having the very same issue, with newline \n characters not rendering with line breaks. It actually looks the same as @bboris.
@bboris did you find a solution?

Prodigy version is 1.8.1

It seems like the problem in those cases is that the newlines aren’t separate tokens but rather, part of a token. For example, you might have two newlines in one token. I guess one simple thing you could try is to pre-process the text and add a space between the \n\n to ensure they become separate tokens.

Hi Ines,

I have come across the same problem mentioned in this thread: prodigy doesn't display consecutive new lines when I have ' \n\n'. For 1 single newline, it does work.

While preprocessing the text, I have added a space in between the new lines, so that they are now: '\n \n' but Prodigy still doesn't show the new line. Could it be that this is still being converted to 1 single token?

Prodigy version is 1.6, using the recipe ner.manual with "hide_true_newline_tokens": False

Thanks

That's possible, yes! :disappointed:

I just tested it locally and the following works for me:

import spacy
from spacy.util import compile_infix_regex

nlp = spacy.load("en_core_web_sm")
infixes = nlp.Defaults.infixes + (r'''\n''',) 
infixes_regex = compile_infix_regex(infixes) 
nlp.tokenizer.infix_finditer = infixes_regex.finditer   
doc = nlp("Hello\n\nworld")
print([token.text for token in doc])  # ['Hello', '\n', '\n', 'world']

nlp.to_disk("/path/to/updated-model")

This adds a rule to the tokenizer treating \n as an infix and will split it if it occurs within a string. Modification the nlp object's tokenizer will be serialized with it when you save it to a directory. You can then use that directory as the input model in Prodigy instead of en_core_web_sm etc. and your custom tokenization should be applied to the incoming text.

(This is btw one of the reasons why the ner.manual recipe takes a model for tokenization – it should make it a bit easier to load in custom models with modified rules.)

Thanks! I should run this Python script once and then then model "en_core_web_sm" will always tokenize '\n' as a single token, right? And for annotating, I just call the recipe with the same model name "en_core_web_sm"?

Yes, ideally you'd be running this script once to save out a new custom model with updated tokenization. That model will be saved to a directory – in my example, I used a dummy path /path/to/updated-model. Instead of en_core_web_sm, you'd then pass that model directory to Prodigy:

prodigy ner.manual your_dataset /path/to/updated-model your_data.jsonl --label LABEL_ONE,LABEL_TWO
1 Like

Thanks Ines. I confirm this solved the issue and double \n is now rendered with a new line in the Prodigy interface.

However, I was getting an error appending to the infixes list and changed to:

infixes = nlp.Defaults.infixes.copy()
infixes.append(r'''\n''')