Right. So I did a bit of a deep dive and ended up making a script. There are some Prodigy details that are worth explaining, so I'll explain the script through an example that I've got running locally. The explanation will be verbose, just to ensure completeness, also for anybody else who might read this post later.
Simple Example
I have an examples.jsonl
file locally with the following contents.
{"text": "hi my name is vincent"}
{"text": "hi my name is jenny"}
{"text": "hi my name is brian"}
This dataset is meant for illustrative purposes for NER as I will be annotating names. I'll use ner.manual
for that.
PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true}' python -m prodigy ner.manual issue-6365 blank:en examples.jsonl --label person
Note that I'm setting feed_overlap
via the PRODIGY_CONFIG_OVERRIDES
here to make sure that I can pass /?session=name
in the URL.
Next, I've annotated some examples. One assuming annotator "a" and another assuming annotator "b". I can show the annotations via the db-out
recipe and I'm using jq
below to show the relevant keys of the examples.
python -m prodigy db-out issue-6365 | jq -c "{text: .text, answer: .answer, _annotator_id: ._annotator_id}"
This yields this output:
{"text":"hi my name is vincent","answer":"accept","_annotator_id":"issue-6365-b"}
{"text":"hi my name is vincent","answer":"accept","_annotator_id":"issue-6365-a"}
{"text":"hi my name is jenny","answer":"accept","_annotator_id":"issue-6365-a"}
{"text":"hi my name is brian","answer":"accept","_annotator_id":"issue-6365-a"}
In this example there is one text that is annotated by two people. The other two examples have only been annotated by annotator "a". Note that the _annnotator_id
also contains the name of the dataset, which is a Prodigy convention.
Towards Splitting
Next, we'd want to split this dataset. But for that we get to use some extra information that Prodigy adds to the annotated examples. Let's look at a full annotation example.
{
"text":"hi my name is brian",
"_input_hash":1445123937,
"_task_hash":1571525982,
"_is_binary":false,
"tokens": [...],
"_view_id":"ner_manual",
"spans":[
{
"start":14,
"end":19,
"token_start":4,
"token_end":4,
"label":"person"
}
],
"answer":"accept",
"_timestamp":1676388878,
"_annotator_id":"issue-6365-a",
"_session_id":"issue-6365-a"
}
Note: I've collapsed the tokens for sake of brevity.
You'll notice that Prodigy adds an _input_hash
and a _task_hash
. These two hashes are used to deduplicate an annotation. In this case the input hash is defined by the text and the task hash is defined by the PERSON label that we're annotating. I'll skip over the details on how these are created; for our purposes we merely want to re-use them to split out dataset.
I took the liberty of writing a script (called split.py
) that does that. Here it is:
import srsly
def split(file_in, file_overlap, file_non_overlap):
examples = list(srsly.read_jsonl(file_in))
# First map (input, task) -> annotator
hash_session_map = {}
for ex in examples:
lookup = (ex["_input_hash"], ex["_task_hash"])
if lookup not in hash_session_map:
hash_session_map[lookup] = []
hash_session_map[lookup].append(ex["_annotator_id"])
# Next use this dictionary to split dataset
overlap_examples = []
non_overlap_examples = []
for ex in examples:
lookup = (ex["_input_hash"], ex["_task_hash"])
if len(hash_session_map[lookup]) > 1:
overlap_examples.append(ex)
else:
non_overlap_examples.append(ex)
# Write results to disk
srsly.write_jsonl(file_overlap, overlap_examples)
srsly.write_jsonl(file_non_overlap, non_overlap_examples)
if __name__ == "__main__":
split("annotations.jsonl", "overlap.jsonl", "non-overlap.jsonl")
This script is meant for demonstration purposes, but should work for small-to-medium datasets. For very large datasets you may want to consider the memory more.
With this script, we can first export the annotations in Prodigy.
python -m prodigy db-out issue-6365 > annotations.json
Then we can run this script.
python split.py
When you run this script, you should see two files appear.
overlap.jsonl
{"text":"hi my name is vincent","_input_hash":-796403495,"_task_hash":-1601891474,"_is_binary":false,"tokens":[{"text":"hi","start":0,"end":2,"id":0,"ws":true},{"text":"my","start":3,"end":5,"id":1,"ws":true},{"text":"name","start":6,"end":10,"id":2,"ws":true},{"text":"is","start":11,"end":13,"id":3,"ws":true},{"text":"vincent","start":14,"end":21,"id":4,"ws":false}],"_view_id":"ner_manual","answer":"accept","_timestamp":1676388860,"_annotator_id":"issue-6365-b","_session_id":"issue-6365-b"}
{"text":"hi my name is vincent","_input_hash":-796403495,"_task_hash":-1601891474,"_is_binary":false,"tokens":[{"text":"hi","start":0,"end":2,"id":0,"ws":true},{"text":"my","start":3,"end":5,"id":1,"ws":true},{"text":"name","start":6,"end":10,"id":2,"ws":true},{"text":"is","start":11,"end":13,"id":3,"ws":true},{"text":"vincent","start":14,"end":21,"id":4,"ws":false}],"_view_id":"ner_manual","spans":[{"start":14,"end":21,"token_start":4,"token_end":4,"label":"person"}],"answer":"accept","_timestamp":1676388875,"_annotator_id":"issue-6365-a","_session_id":"issue-6365-a"}
non-overlap.jsonl
{"text":"hi my name is jenny","_input_hash":-1772613529,"_task_hash":2124137239,"_is_binary":false,"tokens":[{"text":"hi","start":0,"end":2,"id":0,"ws":true},{"text":"my","start":3,"end":5,"id":1,"ws":true},{"text":"name","start":6,"end":10,"id":2,"ws":true},{"text":"is","start":11,"end":13,"id":3,"ws":true},{"text":"jenny","start":14,"end":19,"id":4,"ws":false}],"_view_id":"ner_manual","spans":[{"start":14,"end":19,"token_start":4,"token_end":4,"label":"person"}],"answer":"accept","_timestamp":1676388876,"_annotator_id":"issue-6365-a","_session_id":"issue-6365-a"}
{"text":"hi my name is brian","_input_hash":1445123937,"_task_hash":1571525982,"_is_binary":false,"tokens":[{"text":"hi","start":0,"end":2,"id":0,"ws":true},{"text":"my","start":3,"end":5,"id":1,"ws":true},{"text":"name","start":6,"end":10,"id":2,"ws":true},{"text":"is","start":11,"end":13,"id":3,"ws":true},{"text":"brian","start":14,"end":19,"id":4,"ws":false}],"_view_id":"ner_manual","spans":[{"start":14,"end":19,"token_start":4,"token_end":4,"label":"person"}],"answer":"accept","_timestamp":1676388878,"_annotator_id":"issue-6365-a","_session_id":"issue-6365-a"}
Next steps
From here I think you'd want to feed overlap.jsonl
to db-in
such that you have a dataset ready for review. This way, you'll only have examples where at least two annotators saw each example which means that you can use --auto-accept
again.
Once you've reviewed your data, you can also upload non-overlap.jsonl
into Prodigy via db-in
.
The nice thing about that last step is that you should be able to use prodigy train
again. If I assume the reviewed dataset is called ner-reviewed
and the other dataset is called ner-non-overlap
then you should be able to run the train command via:
prodigy train --ner "ner-reviewed,ner-non-overlap"
This will automatically pick up the examples from both datasets.
An idea
I think such a script would solve your problem for now, but I'm going to discuss this example with some colleagues. We may want to think of adding a feature that allows something like this without having to write a custom script. Will keep you posted in this thread if we decide on anything.
Final Tip
I personally enjoy making custom scripts for everything and it plays very nicely with the projects feature from spaCy. This feature is designed to also play very well with Prodigy, and I highly recommend checking that out if you haven't already. It makes it easy to chain a few of these commands together in such a way that scripts will only run once there is new data added into the mix.
Let me know if there are extra questions.