NER review datasets with partial overlap while keeping all texts

Hi,

I am looking for a way to review the NER tagging of two datasets that only partially overlap, while retaining the texts that do not overlap.

Say I have two datasets, rater_1 and rater_2, and want to create a final_set that includes rater_1 and rater_2 data, but has been reviewed.

  • rater_1 consists of texts A, B, and C
  • rater_2 consists of texts A, D

, where A has not been annotated in the same way for rater_1 and rater_2 .

I want to automatically accept B, D and C and include them in the final_set without having to go over them manually in the review interface. Moreover, I want to review A, and naturally keep this in final_set.

How do I do this?

I have tried:

prodigy review final_set rater_1,rater_2, --labels ORGANIZATION -S -A

which gives me final_set that contains only A

My .jsonl files that have been added to db have the following format for each line.

{"text":"Dethleffs","tokens":[{"text":"Dethleffs","start":0,"end":9,"id":0,"ws":false}],"_is_binary":false,"_view_id":"ner_manual","answer":"accept","_timestamp":1676024326,"spans":[{"start":0,"end":9,"label":"ORGANIZATION","token_start":0,"token_end":0}]}

The --auto-accept option only automatically accepts examples where at least two annotators agree. Some of the details of that behavior are explained in this thread:

I'm wondering why you'd automatically want to accept annotations if only a single annotator was involved. It could work out, but you are risking accepting candidates where annotators might be disagreeing.

Could you give a bit more background? Is there a reason why only a subset of the annotated examples should be seen by two annotators?

Thank you for your answer!

The --auto-accept option only automatically accepts examples where at least two annotators agree. Some of the details of that behavior are explained in this thread:

Aaah, that's why! It doesn't seem to correspond entirely to the documentation on the --auto-accept flag, which may have gotten me confused

Could you give a bit more background? Is there a reason why only a subset of the annotated examples should be seen by two annotators?

Yes of course. My use-case is this:
I have a manually annotated dataset in a low-resource language, but the annotations for some of the tags (e.g. PRODUCT) are missing. To improve the dataset I have added additional labels on PRODUCT on the same texts from the predictions of a NER model trained on another language.
However, since these predictions are relatively poor, I want to review all cases of inconsistencies between the model (but only for the texts where the model has found a PRODUCT span) and the manually annotated dataset.

I tried doing the review with all texts from both the model predictions and the manually annotated dataset, but then all texts have inconsistencies. I would have to go through all texts, since the model only predicts PRODUCT.

Is there a way to change the default behaviour to the one I am requesting?

I'm wondering why you'd automatically want to accept annotations if only a single annotator was involved. It could work out, but you are risking accepting candidates where annotators might be disagreeing.

To me it seems like a very useful feature, more broadly than just for my use case. Annotation often entails having ~10% overlapping data for calculating interrater reliability and subsequently accepting the remaining ~90% of annotated data with off-set in high interrater reliability scores. This greatly reduces the cost of annotators and thus proportionally increases number of texts annotated, with only negligible performance detriments given high interrater reliability. I have previously struggled with this, for this exact use case.

It sounds like you may be interested in just writing a custom Python script that can select the subset that you're interested in. This subset can then be loaded into Prodigy via the db-in recipe which can then be used to make a review.

I want to review all cases of inconsistencies between the model (but only for the texts where the model has found a PRODUCT span) and the manually annotated dataset.

This sounds like a pretty good idea to start with! I've gotten a lot of mileage out of similar "when models disagree we should check"-kinds of tactics.

This greatly reduces the cost of annotators and thus proportionally increases number of texts annotated, with only negligible performance detriments given high interrater reliability.

I see what you mean here. If you've already confirmed that the annotation task is well understood then it can save a whole bunch of time. Let me dive into this a bit more. I'll report back! Possibly with a script that might help.

I have previously struggled with this, for this exact use case.

I'm mainly asking out of curiosity, could you share this anekdote?

1 Like

Right. So I did a bit of a deep dive and ended up making a script. There are some Prodigy details that are worth explaining, so I'll explain the script through an example that I've got running locally. The explanation will be verbose, just to ensure completeness, also for anybody else who might read this post later.

Simple Example

I have an examples.jsonl file locally with the following contents.

{"text": "hi my name is vincent"}
{"text": "hi my name is jenny"}
{"text": "hi my name is brian"}

This dataset is meant for illustrative purposes for NER as I will be annotating names. I'll use ner.manual for that.

PRODIGY_CONFIG_OVERRIDES='{"feed_overlap": true}' python -m prodigy ner.manual issue-6365 blank:en examples.jsonl --label person

Note that I'm setting feed_overlap via the PRODIGY_CONFIG_OVERRIDES here to make sure that I can pass /?session=name in the URL.

Next, I've annotated some examples. One assuming annotator "a" and another assuming annotator "b". I can show the annotations via the db-out recipe and I'm using jq below to show the relevant keys of the examples.

python -m prodigy db-out issue-6365 | jq -c "{text: .text, answer: .answer, _annotator_id: ._annotator_id}"

This yields this output:

{"text":"hi my name is vincent","answer":"accept","_annotator_id":"issue-6365-b"}
{"text":"hi my name is vincent","answer":"accept","_annotator_id":"issue-6365-a"}
{"text":"hi my name is jenny","answer":"accept","_annotator_id":"issue-6365-a"}
{"text":"hi my name is brian","answer":"accept","_annotator_id":"issue-6365-a"}

In this example there is one text that is annotated by two people. The other two examples have only been annotated by annotator "a". Note that the _annnotator_id also contains the name of the dataset, which is a Prodigy convention.

Towards Splitting

Next, we'd want to split this dataset. But for that we get to use some extra information that Prodigy adds to the annotated examples. Let's look at a full annotation example.

{
   "text":"hi my name is brian",
   "_input_hash":1445123937,
   "_task_hash":1571525982,
   "_is_binary":false,
   "tokens": [...],
   "_view_id":"ner_manual",
   "spans":[
      {
         "start":14,
         "end":19,
         "token_start":4,
         "token_end":4,
         "label":"person"
      }
   ],
   "answer":"accept",
   "_timestamp":1676388878,
   "_annotator_id":"issue-6365-a",
   "_session_id":"issue-6365-a"
}

Note: I've collapsed the tokens for sake of brevity.

You'll notice that Prodigy adds an _input_hash and a _task_hash. These two hashes are used to deduplicate an annotation. In this case the input hash is defined by the text and the task hash is defined by the PERSON label that we're annotating. I'll skip over the details on how these are created; for our purposes we merely want to re-use them to split out dataset.

I took the liberty of writing a script (called split.py) that does that. Here it is:

import srsly 

def split(file_in, file_overlap, file_non_overlap):
    examples = list(srsly.read_jsonl(file_in))
    # First map (input, task) -> annotator
    hash_session_map = {}
    for ex in examples:
        lookup = (ex["_input_hash"], ex["_task_hash"])
        if lookup not in hash_session_map:
            hash_session_map[lookup] = []
        hash_session_map[lookup].append(ex["_annotator_id"])
   
    # Next use this dictionary to split dataset
    overlap_examples = []
    non_overlap_examples = []
    for ex in examples:
        lookup = (ex["_input_hash"], ex["_task_hash"])
        if len(hash_session_map[lookup]) > 1:
            overlap_examples.append(ex)
        else:
            non_overlap_examples.append(ex)

    # Write results to disk
    srsly.write_jsonl(file_overlap, overlap_examples)
    srsly.write_jsonl(file_non_overlap, non_overlap_examples)
        

if __name__ == "__main__":
    split("annotations.jsonl", "overlap.jsonl", "non-overlap.jsonl")

This script is meant for demonstration purposes, but should work for small-to-medium datasets. For very large datasets you may want to consider the memory more.

With this script, we can first export the annotations in Prodigy.

python -m prodigy db-out issue-6365 > annotations.json

Then we can run this script.

python split.py

When you run this script, you should see two files appear.

overlap.jsonl

{"text":"hi my name is vincent","_input_hash":-796403495,"_task_hash":-1601891474,"_is_binary":false,"tokens":[{"text":"hi","start":0,"end":2,"id":0,"ws":true},{"text":"my","start":3,"end":5,"id":1,"ws":true},{"text":"name","start":6,"end":10,"id":2,"ws":true},{"text":"is","start":11,"end":13,"id":3,"ws":true},{"text":"vincent","start":14,"end":21,"id":4,"ws":false}],"_view_id":"ner_manual","answer":"accept","_timestamp":1676388860,"_annotator_id":"issue-6365-b","_session_id":"issue-6365-b"}
{"text":"hi my name is vincent","_input_hash":-796403495,"_task_hash":-1601891474,"_is_binary":false,"tokens":[{"text":"hi","start":0,"end":2,"id":0,"ws":true},{"text":"my","start":3,"end":5,"id":1,"ws":true},{"text":"name","start":6,"end":10,"id":2,"ws":true},{"text":"is","start":11,"end":13,"id":3,"ws":true},{"text":"vincent","start":14,"end":21,"id":4,"ws":false}],"_view_id":"ner_manual","spans":[{"start":14,"end":21,"token_start":4,"token_end":4,"label":"person"}],"answer":"accept","_timestamp":1676388875,"_annotator_id":"issue-6365-a","_session_id":"issue-6365-a"}

non-overlap.jsonl

{"text":"hi my name is jenny","_input_hash":-1772613529,"_task_hash":2124137239,"_is_binary":false,"tokens":[{"text":"hi","start":0,"end":2,"id":0,"ws":true},{"text":"my","start":3,"end":5,"id":1,"ws":true},{"text":"name","start":6,"end":10,"id":2,"ws":true},{"text":"is","start":11,"end":13,"id":3,"ws":true},{"text":"jenny","start":14,"end":19,"id":4,"ws":false}],"_view_id":"ner_manual","spans":[{"start":14,"end":19,"token_start":4,"token_end":4,"label":"person"}],"answer":"accept","_timestamp":1676388876,"_annotator_id":"issue-6365-a","_session_id":"issue-6365-a"}
{"text":"hi my name is brian","_input_hash":1445123937,"_task_hash":1571525982,"_is_binary":false,"tokens":[{"text":"hi","start":0,"end":2,"id":0,"ws":true},{"text":"my","start":3,"end":5,"id":1,"ws":true},{"text":"name","start":6,"end":10,"id":2,"ws":true},{"text":"is","start":11,"end":13,"id":3,"ws":true},{"text":"brian","start":14,"end":19,"id":4,"ws":false}],"_view_id":"ner_manual","spans":[{"start":14,"end":19,"token_start":4,"token_end":4,"label":"person"}],"answer":"accept","_timestamp":1676388878,"_annotator_id":"issue-6365-a","_session_id":"issue-6365-a"}

Next steps

From here I think you'd want to feed overlap.jsonl to db-in such that you have a dataset ready for review. This way, you'll only have examples where at least two annotators saw each example which means that you can use --auto-accept again.

Once you've reviewed your data, you can also upload non-overlap.jsonl into Prodigy via db-in.

The nice thing about that last step is that you should be able to use prodigy train again. If I assume the reviewed dataset is called ner-reviewed and the other dataset is called ner-non-overlap then you should be able to run the train command via:

prodigy train --ner "ner-reviewed,ner-non-overlap"

This will automatically pick up the examples from both datasets.

An idea

I think such a script would solve your problem for now, but I'm going to discuss this example with some colleagues. We may want to think of adding a feature that allows something like this without having to write a custom script. Will keep you posted in this thread if we decide on anything.

Final Tip

I personally enjoy making custom scripts for everything and it plays very nicely with the projects feature from spaCy. This feature is designed to also play very well with Prodigy, and I highly recommend checking that out if you haven't already. It makes it easy to chain a few of these commands together in such a way that scripts will only run once there is new data added into the mix.

Let me know if there are extra questions.

2 Likes

Hi Vincent,

Thank you very much for your replies - I really appreciate the thoroughness. It is worth praising the way in which you and your team take the time to support us users of Prodigy.

The example you have given is easy to follow and I will try implementing it for my use. It is also great to see the use of some of the more sophisticated ways of using Prodigy. I have yet to try out the spaCy projects, but I am planning to use it when releasing gold-standard version of the dataset as well as the final model trained on the dataset.

We may want to think of adding a feature that allows something like this without having to write a custom script. Will keep you posted in this thread if we decide on anything.

Could perhaps be a parameter one might specify within the review recipe? Just a thought from someone with only a very limited idea of how/where such a feature could be useful.

Best regards,
Emil

Happy to hear it :smile:

There are a few ideas floating around at the moment, and I want to be careful not to promise anything that we might not do, but one idea is to add an extra flag. Something like:

prodigy review ... --auto-accept --accept-single

The --accept-single flag would accept every annotation where a single person annotated.

Just to gather feedback, would that work for you? One thing that's making me doubt this approach is that the feed is still given to the front-end in sequence. That means that if you have lots of examples where two people disagreed followed by all the examples where a single person annotated ... that you'll first need to annotate all the two-people examples before the rest is automatically added. That would mean the order of the examples might become very important.

Again, I'm curious to hear your feedback, but I want to be careful with making any promises. There's a bunch of details to consider and I'd like to get some feedback from colleagues too.

No promises needed! For my current project I already have the solution :wink: This solution with the flag would also work for my purpose at least and would have been tremendously helpful.

I think I understand your point on having to review large amounts of data first. But perhaps the unique examples from each annotator (unique _input_hashes) could be handled (accepted) before the actual review cases?

I am not sure about the behaviour of prodigy review ... -A, but to me it would seem that this exact issue you're mentioned might also already be prevalent in this similar case.

Best,
Emil