Segmentation and newlines in ner.manual

bhanu · April 17, 2018, 6:27am

I wish to display an entire length of contract while annotating using ner.manual in one go.
Since Prodigy takes input from one file only, i have multiple contracts placed sequentially in that file, is there a way for me to display an entire length of contract at one go and after annotating it move to the next?

Basically if each contract is separated by \xa0 , eg.

contract 1
\xa0
contract 2
\xa0
contract 3

Then how do i display each contract in one go and tell prodigy to segment the input file along \xa0

ines · April 17, 2018, 5:25pm

Hi! I hope I’m understanding your question correctly – so you want to load in texts from a “custom” format and separate them into annotation tasks according to your own logic, right?

One option would of course be to pre-process your data, read in the input file, split on \xa0 and then output a JSONL file with {"text": "contract 1"} etc per line.

You can also do this with a custom loader script in Python and then pip its output forward to the recipe. If no source argument is set on the command line, it will default to stdin (i.e. the output of the previous process). I’m describing this in more detail on this thread.

Here’s an example:

import json

contracts = YOUR_LONG_TEXT.split('\xa0')
# you might also want to do some stripping of whitespace etc. here

for contract in contracts:
    task = {'text': contract}
    print(json.dumps(task))  # output dumped JSON

You can then pipe the tasks forward like this:

python your_script.py | prodigy ner.manual your_dataset en_core_web_sm --label SOME_LABEL

bhanu · April 18, 2018, 7:24am

Hey thanks for the segmentation help, i had one more query in this regards though, If i display the text of an entire contract in this form:

{"text": "contract 1"}

Then the contract text comes all bunched up without respecting any \n character.

<Title> ... <Introduction>...<Clause 1><Clause 2> ... <Signatories>

What if i want to keep the formatting too while showing the contract prodigy UI card ?

<Title>
...
<Introduction>
...
<Clause 1>
<Clause 2>
...
<Signatories>

Something along these lines

ines · April 18, 2018, 9:09pm

In order to allow faster annotation, the manual interface pre-tokenizes the text (so your selection can snap to the token boundaries). This means that single whitespace is used for splitting the words, e.g. "Hello\nworld" will become ["hello", "world"].

Additional whitespace will be preserved, though. The manual NER interface should then replace \n with a ↵ character, to give you a visual indicator of the line break. The reason it works like that in manual mode (as opposed to just rendering a line break like in the other interfaces) is that you need a way of annotating the whitespace. Whitespace is important, because it can have an effect on the model – and the UI also needs to allow highlighting line break characters (which is very difficult if there’s no visual indicator).

Another thing to consider is that the manual interfaces (and pretty much all others) are really designed for shorter texts that you can focus on and work through quickly. So you might want to try adding more pre-processing to your contracts, and split them into paragraphs or even shorter units like sentences. This will also give you more “checkpoints” and save intermediate progress faster.

If you feel like you need the entire context of the contract to annotate the entities, it will actually be very difficult for your model to learn anything meaningful later on. The model is able to pick up on local context very well – but if it’s difficult for you, the human, to make the annotation decision based on the local context, it will be near impossible for the model to generalise any of those decisions.

bhanu · April 19, 2018, 1:06pm

Thank you for the help.

menshikh-iv · April 21, 2018, 2:37pm

Is it possible to disable ↵ character and render \n as <br> (or any similar way)?

ines · April 22, 2018, 3:30pm

Not at the moment – but I can experiment with actually adding a line break behind the newline indicator. The reason we're using a ↵ character is that there needs to be some sort of indicator that a line break token is present – at least in manual annotation mode. Otherwise, you won't be able to select it.

I thought about this a lot and the problem with actually rendering it as \n or something is that this is way too ambiguous – because you'll have no way of knowing whether it's just a line break or the actual string "\n". This isn't that uncommon, especially if you're working with unclean data. So we figured that a similar clash on the ↵ unicode character was significantly less likely. (That said, I'm also open for other suggestions!)

menshikh-iv · April 23, 2018, 8:39am

Probably you can combine ↵ and real \n, i.e. render ↵ instead of \n (as now) + add visual-only linebreak. This looks like good trade-off (and can be easily configured in prodigy.json for example).

KMLDS · May 4, 2018, 5:08pm

When I create a custom recipe based on ner.manual, the output explicitly shows newline characters when rendered. For my current task it would be better if the text were just displayed normally without visible newlines, and the rendered document actually just continues on the next line. Is this a simple configuration change I’m missing? Apologies in advance if I’m overlooking the answer in an obvious place in the docs or on the forum (although I did search both, I swear!).

ines · May 4, 2018, 5:23pm

@KMLDS I merged your thread onto this one, because I remembered the newline discussion here – but it was really hidden away in the comments.

See my comments above for some background on why ner.manual in particular needs at least some character-based representation of newlines. However, I’ve been experimenting with the solution I suggested above, which is to add a line break after the indicator.

How relevant are the newlines or newline tokens to what you’re doing? If you don’t need them in your training data, one solution could be to add a preprocessing step that removes them from your text. (Just keep in mind that you probably want to preprocess your runtime inputs the same way, especially if you’re using spaCy, which will preserve double whitespace as individual tokens. If your model was trained on data that never included whitespace tokens, and it suddenly encounters them at runtime, this might lead to unexpected results.)

KMLDS · May 4, 2018, 6:42pm

Thanks @ines - I assumed that was the reason for displaying whitespace characters. In my case, I will have a couple of subject matter experts doing some labeling of documents with a familiar format to them. The important thing I’m missing with the current rendering is the visual cues from paragraph breaks and bulleted lists and similar. If I either just remove the whitespace or break my training examples down into smaller chunks (e.g. just showing text between ‘\n\n’ tokens), it will take them much longer to go through the documents we want to label.

For later modeling efforts on this task there is no semantic difference between ‘\n’ and ’ ', and it doesn’t really matter to me if trailing or preceding ‘\n\n’ tokens are captured (I can just remove them from training data or model outputs, they have no importance to the task at hand).

ines · May 7, 2018, 10:56am

@KMLDS Thanks for sharing your use case – that’s pretty interesting and I see your point about formatting the data as a list or adding other visual hints. Something similar also came up on this thread. I’ve shared some of my thoughts and the possible complications around this on that thread as well. I still don’t have a perfect solution in mind, but I’m sure we can come up with something that works across use cases!

blakey · May 17, 2018, 10:31pm

Hi @ines, I’m coming across the same issue as @KMLDS - basically that the visual line breaks in the text provide useful cues during the labeling process with a ner.manual recipe. For me, it would be great to have both the ↵ and the visual line break in the web app.

ines · May 22, 2018, 9:13am

Quick update: I tested the "↵ plus line break" solution and it's been working well – so we will be able to ship this update with the next release

I've also been experimenting with solutions for use cases like this one and how to allow adding more visual clues to the manual interface:

In the upcoming version, you'll be able to mark individual "tokens" in the input data as "disabled": true. This will render them in grey and will prevent the user from selecting those tokens (or any text spanning across them). The disabled tokens can be used for whitespace characters, list bullets and other tokens purely intended for formatting, and they can also help the annotator identify what's important quicker. You can also use them to prevent highlighting mistakes (e.g. by disabling all newline tokens to not allow entities spanning over two paragraphs). The "disabled" property can also make it easier later on to separate annotator-only markup from the annotated text.

ines · June 7, 2018, 5:54pm

@blakey @KMLDS @menshikh-iv @bhanu

Just released v1.5.0, which includes the fixes I described above:

Newlines in manual mode are now rendered as ↵ plus line break.
To disable this behaviour (e.g. if your text contains lots of newlines like in this example), you can set "hide_true_newline_tokens": true.
You can now mark individual tokens as "disabled": true, which will render them in grey and will prevent the user from selecting them. This may be a nice solution for use cases similar to the one described by @KMLDS, where the text should be enhanced with formatting markup (lists, line breaks etc.) to help the annotator

ecallen7979 · October 18, 2018, 3:32pm

Hi Ines, I am using Prodigy 1.5.1 and still seeing the ↵ characters without a line break for manual labeling. Is there a configuration I need to set or should this work by default?

Thank you!
Erik

ines · October 18, 2018, 4:50pm

@ecallen7979 Hmm, that’s strange – let me look into this! It should work without requiring additional config, but maybe something isn’t running as expected here.

ecallen7979 · October 18, 2018, 5:38pm

Thank you!

ines · October 19, 2018, 4:21pm

@ecallen7979 I just tested it and I can’t seems to reproduce this Do you have an example text?

This is definitely the expected rendering:

The "hide_true_newline_tokens" settings lets you enable hidden newlines in your config, but it should default to false.

bboris · November 7, 2018, 9:44pm

Hi Ines

Got similar situation.
This json example:

{"text": "\nGSR-1-PE-5# show controller fia\n\nFabric configuration: 10Gbps bandwidth (2.4Gbps available), redundant fabric\n\nMaster Scheduler: Slot 17 Backup Scheduler: Slot 16\n\nFab epoch no 0 Halt count 0\n\nFrom Fabric FIA Errors\n\n\-----------------------\n\nredund overflow 0 cell drops 0\n\ncell parity 0\n\nSwitch cards present 0x001F Slots 16 17 18 19 20\n\nSwitch cards monitored 0x001F Slots 16 17 18 19 20\n\nSlot: 16 17 18 19 20\n\nName: csc0 csc1 sfc0 sfc1 sfc2\n\n\-------- \-------- \-------- \-------- \--------\n\nlos 0 0 0 0 0\n\nstate Off Off Off Off Off\n\ncrc16 0 0 0 0 0\n\nTo Fabric FIA Errors\n\n\-----------------------\n\nsca not pres 0 req error 0 uni fifo overflow 0\n\ngrant parity 0 multi req 0 uni fifo undrflow 0\n\ncntrl parity 0 uni req 0\n\nmulti fifo 0 empty dst req 0 handshake error 0\n\ncell parity 0\n\nGSR-1-PE-5# attach 1\n\nEntering Console for Modular SPA Interface Card in Slot: 1\n\nType "exit" to end this session\n\nPress RETURN to get started!\n\nLC-Slot1>en\n\nLC-Slot1# test fab\n\nBFLC diagnostic console program\n\nBFLC (? for help) [?]: qm_sanity_debug\n\nQM Sanity Debug enabled\n\nBFLC (? for help) [qm_sanity_debug]:\n\nSLOT 1:02:54:33: ToFAB BMA information\n\nSLOT 1:02:54:33: Number of FreeQs carved 4\n\nSLOT 1:02:54:33: Pool 1: Carve Size 102001: Current Size 102001\n\nSLOT 1:02:54:33: Pool 2: Carve Size 78462: Current Size 78462\n\nSLOT 1:02:54:33: Pool 3: Carve Size 57539: Current Size 57539\n\nSLOT 1:02:54:33: Pool 4: Carve Size 22870: Current Size 22870\n\nSLOT 1:02:54:33: IPC FreeQ: Carve Size 600: Current Size 600\n\nSLOT 1:02:54:33: Number of LOQs enabled 768\n\nSLOT 1:02:54:33: Number of LOQs disabled 1280\n\nSLOT 1:02:54:33: ToFAB BMA information\n\nSLOT 1:02:54:33: Number of FreeQs carved 4\n\nSLOT 1:02:54:33: Pool 1: Carve Size 102001: Current Size 102001\n\nSLOT 1:02:54:33: Pool 2: Carve Size 78462: Current Size 78462\n\nSLOT 1:02:54:33: Pool 3: Carve Size 57539: Current Size 57539\n\nSLOT 1:02:54:33: Pool 4: Carve Size 22870: Current Size 22870\n\nSLOT 1:02:54:33: IPC FreeQ: Carve Size 600: Current Size 600\n\nSLOT 1:02:54:33: Number of LOQs enabled 768\n\nSLOT 1:02:54:33: Number of LOQs disabled 1280\n\nQM Sanity Debug disabled\n\nBFLC (? for help) [qm_sanity_debug]: qm_sanity_info\n\nToFab QM Sanity level Warning\n\nFrFab QM Sanity level None\n\nSanity Check is triggered every 20 seconds\n\nMin. "}

Renders like this:

running prodigy 1.5.1 and "hide_true_newline_tokens" set to false explicitly.

Thank you

Topic		Replies	Views
Customizations for the ner.teach UI ner	3	1195	January 11, 2018
Best Practices for Segmenting Text into Passages and Applying Multi-label Classification	1	533	September 13, 2023
Strange text segmentation with ner.teach recipe usage	7	561	September 9, 2019
prodigy splitting sentences for annotation enhancement , usage , done	14	3266	December 12, 2019
HTML to jsonl and NER task workflow usage , ner , solved	6	809	July 19, 2019

Segmentation and newlines in ner.manual

Related Topics