Review recipe: auto accept identical annotations

Hi, I would like to ask for advice on the review recipe.

I'm at the beginning of a large annotation project (20.000 examples) with a team of annotators. We plan to annotate using ner.manual and annotate each example twice by using a group_a and group_b session ID.

After an example has been annotated by both groups, I run the review recipe.

  • Is it possible to auto-accept (and add to review dataset) all annotations that were annotated identically by the two groups? That would save clicking through the 90% of examples where that was the case.
  • I'll review throughout the process. Is it possible to let review only show the examples that already have 2 annotations (not just 1)?

What is the easiest way to do that?

I checked the custom recipes Github repo and read the review.py file, but am a bit lost with how I would go about modifying the recipe.

Any hints would be much appreciated :slight_smile:

Best regards
Paul

Hi! This is actually something I've had on my list of enhancements for the built-in workflow, because I think it'd be a nice addition :100:

In terms of the implementation, you can think of it like this: if you're using a manual workflow like ner.manual, Prodigy's review workflow will group together all examples with the same input hash (= the same text). The different "versions" that you see in the UI are grouped by task hash (= same annotations). In the JSON data generated by the recipe, the versions are stored as "versions" and each version has a list of "sessions" (dataset or annotation session that created this annotation).

So in your case, you'd want to auto-save all examples with only one version (no conflicts) straight to the database – except for those that only have one session (only annotated by one person), which should be skipped and not annotated for now. In code, it could look like this (untested but should work):

def filter_review_stream(stream, dataset):
    db = connect()
    for eg in stream:
        versions = eg["versions"]
        if len(versions) == 1:  # no conflicts, only one version
            sessions = versions[0]["sessions"]
            if len(sessions) > 1:  # multiple identical versions
                # Add example to dataset automatically
                eg["answer"] = "accept" 
                db.add_examples([eg], [dataset])
            # don't send for annotation
        else:
            yield eg  # send out for annotation

You can call that wrapper in your recipe right before it returns the components. If you just want to hack around, you can also run prodigy stats to find the location of your Prodigy installation and edit recipes/review.py directly.

That's awesome, thank you for the explanation and the code.

I'm unsure how to use that in the recipe function. I put your function into review.py. Then in the review function, I called it:

filtered_stream = filter_review_stream(stream, dataset)

return {
     "view_id": "review",
     "dataset": dataset,
     "stream": filtered_stream,
     "before_db": before_db,
     "config": config,
 }

but that causes an error: "line 240, in filter_review_stream
sessions = versions["sessions"]
TypeError: list indices must be integers or slices, not str".

Yes, that looks correct!

Ah, sorry, I think this is just a typo and should be versions[0]["sessions"], since we're looking at the sessions of the one (and only) version here.

It works! Thanks again, this saved me from going through thousands of identical annotations.

1 Like