Comparing new ner.manual dataset to a reviewed database

Hello,

I have a situation where 3 annotators have annotated the same batch of 50 sentences using ner.manual.

These annotations have been reviewed and validated using [Preformatted text](https://prodi.gy/docs/recipes#review) and saved to a new database.

A new annotator has annotated the same sentences, and the validator would like to review the new annotations with the annotations saved in the database. However, there is an error.

✘ Conflicting view_id values in datasets

Can't review annotations of 'ner_manual' (in dataset 'dummy_moderated') and

'review' (in previous examples)

Why is this? Given that the annotations have been reviewed, why can't the new annotations be reviewed against this dataset?

1 Like

Hi @rory-hurley-gds!

Thanks for the question!

The problem is the dataset that is the output for review (let's call it r1 -- you may have called it dummy_moderated) nests the original annotations along with the reviewer's annotation. The original annotations have view_id == 'ner_manual' while the reviewer annotations has view_id == 'review'. This causes a conflict. You can notice this if you were to run only r1 as prodigy review new_dataset r1.

One solution is a bit hacky but seemed to work for me. You can filter out only the reviewer's annotations from the review dataset and rename its view_id == 'ner_manual' so that the reviewed annotations have the same view_id as your original annotations. Suppose you have five existing Prodigy datasets:

  • a1: first annotator 50 annotations
  • a2: second annotator 50 annotations
  • a3: third annotator 50 annotations
  • r1: reviewer annotations / output (dataset) from prodigy review r1 a1,a2,a3
  • a4: fourth annotator 50 annotations

Get the review dataset (r1) via db.get_dataset() and pass it to clumper to filter, mutate, and export the file to .jsonl. Alternatively, you could use db.add_dataset() instead of exporting to the.jsonl file.

from prodigy.components.db import connect
db = connect()
reviews = db.get_dataset("r1")

# pip install clumper
from clumper import Clumper 

review_only = (Clumper(reviews)
  .keep(lambda d: d['_view_id'] == 'review')
  .mutate(_view_id=lambda d: 'ner_manual'))

review_only.write_jsonl("review-only.jsonl")

Then create a new dataset based on review_only.jsonl:

prodigy db-in r2 review-only.jsonl

And now you should be able to run:

python review r3 r2,a4

Let me know if this doesn't work. I can see the challenge in this and I've made a note. Perhaps there would be a way in the future to have an argument for review to enable only the reviewed annotations from being output (along as the original view_id, not review). That would solve this problem. Thank you again for your question!

Thank you very much @ryanwesslen for the detailed response.

I had anticipated, and was exploring a hacky solution, but just wanted to check there wasn't an existing recipe to do what I wanted to do.

Your instruction is very clear and well laid out. Thank you for your efforts here.

I will update you on my progress.

Thanks,
Rory

1 Like