Curation and re-annotation

Hi,

I have set up a curation recipe for our Prodigy project but I still need to workout how to send the rejected annotations for re-annotation BY THE SAME PERSON.

The current project set up is:

  • 6 annotators
  • 2 curators
  • 1 data set
  • session_id_ is an annotator id

I thought perhaps instead of all annotators working off the 1 master data set, they each have their own data set that contains the rejected annotations for them to re-do and only if there are none left, Prodigy fires off the next text from the master data set.

My current curation recipe is as follows (thanks for this @ines ):

@prodigy.recipe('review_annotations',
                dataset=("Dataset name"),
                n_examples=("Number of examples to randomly review, -1 for all", "option", "n", int),
                manual=("Allow manual corrections", "option", "m", bool)
                )
def review_annotations(dataset, n_examples=-1, manual=False):
  
    examples=[]
    db = connect()
    for dataset_id in db.sessions: #only interested in certain user sessions
        find_user_annot=re.search('(-[a-zA-Z]+)$|(-[a-zA-Z]+\d+)$',dataset_id)
        if find_user_annot:
          annotations = db.get_dataset(dataset_id)
          for text in annotations:
            examples.append(text)

    
    if n_examples > 0:
      random.shuffle(examples)
      examples = examples[:n_examples]
    
    # collect scores here
    scores = {'right': 0, 'wrong': 0}
    
    def update(examples):
        # get all accepted / rejected examples and update scores
        rejected = [eg for eg in examples if eg['answer'] == 'reject']
        accepted = [eg for eg in examples if eg['answer'] == 'accept']
        scores['wrong'] = scores['wrong'] + len(rejected)
        scores['right'] = scores['right'] + len(accepted)
        
    
    def on_exit(ctrl):
        # called when you exit the server, compile results
        total_right = scores['right']
        total_wrong = scores['wrong']
        total = len(examples)
        print('Reviewed dataset', dataset)
        print('Correct:', total_right)
        print('Wrong:', total_wrong)
    
    return {
        'dataset': 'curated',
        'stream': examples,
        'view_id': 'ner_manual' if manual else 'ner',
        'update': update,
        'on_exit': on_exit
    }

Has anyone got this set up or similar and would be willing to share the coding or discuss, please?

Anna

Hi! The recipe looks good – I like the stats in the on_exit hook :slightly_smiling_face:

Just to make sure I understand the goal correctly: After you load the existing annotations, you want to filter out the rejected examples by one specific person, and then ask that person to re-annotate them?

In that case, I do think the easiest and cleanest solution would be to make the annotator name / session an argument of the recipe and start separate instances for the different annotators. Then you could run review_annotations your_dataset --session anna and it would get all annotations for your_dataset-anna and filter out the examples with "answer": "reject".

You could even add all re-annotations to the same final master dataset with different session IDs. Recipes let you return a "get_session_id" function that was originally intended as an alternative for the time stamps (so you don't get a session ID clash if you start the server multiple times within the same second). But you should also be able to use it to just return the same string, like this:

return {
    # etc.
   "dataset": dataset,
    "get_session_id": lambda: f"{dataset}-{user}"
}

Now all annotations will be saved to dataset, with the session ID dataset-{user}. This also makes it easier to check for unannotated examples that are left and give an annotator more work when they're done. For example, something like this:

def get_stream(session_id):
    for eg in examples:
        if eg["_session_id"] == session_id and eg["answer"] == "reject":
            yield eg
    # Annotator is done, let's give them more work and check if
    # there are examples that are not yet in the master dataset
    task_hashes = db.get_task_hashes(dataset)
    for eg in examples:
        if eg["_task_hash"] not in task_hashes:
            yield eg  # nobody has annotated this one yet

Prodigy will only ask for one batch at a time, so the second loop only runs once the first one is done and annotated. Of course, there's always a small chance to end up with duplicates – for example, if user A is just annotating an example when user B checks for unannotated examples. But this should hopefully be minimal and easy to resolve.

(In the future, I think it could be great if Prodigy was able to support streams or "stream factories" that take a second argument – the session ID – so Prodigy could pass that through. At the moment, that's a bit difficult to do in a backwards-compatible way, because streams can be any iterable. But if we can find a solution for this, it'd let you use the session ID in the stream and you'd always know who is currently requesting a new batch and can decide what to send out based on that.)

1 Like

Thank you Ines. Sorry for the late replay. I will investigate your proposed solution.

Anna

1 Like