I wish to display an entire length of contract while annotating using ner.manual in one go.
Since Prodigy takes input from one file only, i have multiple contracts placed sequentially in that file, is there a way for me to display an entire length of contract at one go and after annotating it move to the next?
Basically if each contract is separated by \xa0 , eg.
contract 1
\xa0
contract 2
\xa0
contract 3
Then how do i display each contract in one go and tell prodigy to segment the input file along \xa0
Hi! I hope I’m understanding your question correctly – so you want to load in texts from a “custom” format and separate them into annotation tasks according to your own logic, right?
One option would of course be to pre-process your data, read in the input file, split on \xa0 and then output a JSONL file with {"text": "contract 1"} etc per line.
You can also do this with a custom loader script in Python and then pip its output forward to the recipe. If no source argument is set on the command line, it will default to stdin (i.e. the output of the previous process). I’m describing this in more detail on this thread.
Here’s an example:
import json
contracts = YOUR_LONG_TEXT.split('\xa0')
# you might also want to do some stripping of whitespace etc. here
for contract in contracts:
task = {'text': contract}
print(json.dumps(task)) # output dumped JSON
In order to allow faster annotation, the manual interface pre-tokenizes the text (so your selection can snap to the token boundaries). This means that single whitespace is used for splitting the words, e.g. "Hello\nworld" will become ["hello", "world"].
Additional whitespace will be preserved, though. The manual NER interface should then replace \n with a ↵ character, to give you a visual indicator of the line break. The reason it works like that in manual mode (as opposed to just rendering a line break like in the other interfaces) is that you need a way of annotating the whitespace. Whitespace is important, because it can have an effect on the model – and the UI also needs to allow highlighting line break characters (which is very difficult if there’s no visual indicator).
Another thing to consider is that the manual interfaces (and pretty much all others) are really designed for shorter texts that you can focus on and work through quickly. So you might want to try adding more pre-processing to your contracts, and split them into paragraphs or even shorter units like sentences. This will also give you more “checkpoints” and save intermediate progress faster.
If you feel like you need the entire context of the contract to annotate the entities, it will actually be very difficult for your model to learn anything meaningful later on. The model is able to pick up on local context very well – but if it’s difficult for you, the human, to make the annotation decision based on the local context, it will be near impossible for the model to generalise any of those decisions.
Not at the moment – but I can experiment with actually adding a line break behind the newline indicator. The reason we're using a ↵ character is that there needs to be some sort of indicator that a line break token is present – at least in manual annotation mode. Otherwise, you won't be able to select it.
I thought about this a lot and the problem with actually rendering it as \n or something is that this is way too ambiguous – because you'll have no way of knowing whether it's just a line break or the actual string "\n". This isn't that uncommon, especially if you're working with unclean data. So we figured that a similar clash on the ↵ unicode character was significantly less likely. (That said, I'm also open for other suggestions!)
Probably you can combine ↵ and real \n, i.e. render ↵ instead of \n (as now) + add visual-only linebreak. This looks like good trade-off (and can be easily configured in prodigy.json for example).
When I create a custom recipe based on ner.manual, the output explicitly shows newline characters when rendered. For my current task it would be better if the text were just displayed normally without visible newlines, and the rendered document actually just continues on the next line. Is this a simple configuration change I’m missing? Apologies in advance if I’m overlooking the answer in an obvious place in the docs or on the forum (although I did search both, I swear!).
@KMLDS I merged your thread onto this one, because I remembered the newline discussion here – but it was really hidden away in the comments.
See my comments above for some background on why ner.manual in particular needs at least some character-based representation of newlines. However, I’ve been experimenting with the solution I suggested above, which is to add a line break after the indicator.
How relevant are the newlines or newline tokens to what you’re doing? If you don’t need them in your training data, one solution could be to add a preprocessing step that removes them from your text. (Just keep in mind that you probably want to preprocess your runtime inputs the same way, especially if you’re using spaCy, which will preserve double whitespace as individual tokens. If your model was trained on data that never included whitespace tokens, and it suddenly encounters them at runtime, this might lead to unexpected results.)
Thanks @ines - I assumed that was the reason for displaying whitespace characters. In my case, I will have a couple of subject matter experts doing some labeling of documents with a familiar format to them. The important thing I’m missing with the current rendering is the visual cues from paragraph breaks and bulleted lists and similar. If I either just remove the whitespace or break my training examples down into smaller chunks (e.g. just showing text between ‘\n\n’ tokens), it will take them much longer to go through the documents we want to label.
For later modeling efforts on this task there is no semantic difference between ‘\n’ and ’ ', and it doesn’t really matter to me if trailing or preceding ‘\n\n’ tokens are captured (I can just remove them from training data or model outputs, they have no importance to the task at hand).
@KMLDS Thanks for sharing your use case – that’s pretty interesting and I see your point about formatting the data as a list or adding other visual hints. Something similar also came up on this thread. I’ve shared some of my thoughts and the possible complications around this on that thread as well. I still don’t have a perfect solution in mind, but I’m sure we can come up with something that works across use cases!
Hi @ines, I’m coming across the same issue as @KMLDS - basically that the visual line breaks in the text provide useful cues during the labeling process with a ner.manual recipe. For me, it would be great to have both the ↵ and the visual line break in the web app.
Quick update: I tested the "↵ plus line break" solution and it's been working well – so we will be able to ship this update with the next release
I've also been experimenting with solutions for use cases like this one and how to allow adding more visual clues to the manual interface:
In the upcoming version, you'll be able to mark individual "tokens" in the input data as "disabled": true. This will render them in grey and will prevent the user from selecting those tokens (or any text spanning across them). The disabled tokens can be used for whitespace characters, list bullets and other tokens purely intended for formatting, and they can also help the annotator identify what's important quicker. You can also use them to prevent highlighting mistakes (e.g. by disabling all newline tokens to not allow entities spanning over two paragraphs). The "disabled" property can also make it easier later on to separate annotator-only markup from the annotated text.
Just released v1.5.0, which includes the fixes I described above:
Newlines in manual mode are now rendered as ↵ plus line break.
To disable this behaviour (e.g. if your text contains lots of newlines like in this example), you can set "hide_true_newline_tokens": true.
You can now mark individual tokens as "disabled": true, which will render them in grey and will prevent the user from selecting them. This may be a nice solution for use cases similar to the one described by @KMLDS, where the text should be enhanced with formatting markup (lists, line breaks etc.) to help the annotator
Hi Ines, I am using Prodigy 1.5.1 and still seeing the ↵ characters without a line break for manual labeling. Is there a configuration I need to set or should this work by default?
@ecallen7979 Hmm, that’s strange – let me look into this! It should work without requiring additional config, but maybe something isn’t running as expected here.