Annotators Performance Tracker

Hi, Prodigy team :). I wonder if there is out of a box feature that tracks annotators' performance over time, i.e. by month. Basically, we want to have analytical tools to see the annotators' progress of the tasks so that we can decide additional annotators are needed.

If this is not provided, can you suggest how we can extend prodigy so that having such analytical tools can be implemented? can we somehow save and retrieve the timestamp when an annotation is saved?

Thank you :smiley:

Hi! Timestamps for each annotation as it's submitted in the UI is a feature coming to Prodigy v1.11 – we'll be releasing a new nightly soon that will already have it :tada:

In the meantime, here are some ideas and code snippets you can use to add it yourself: Feature request: timestamps for data entry - #5 by ines Quick note: you probably want to use Math.floor(new Date().getTime() / 1000) for the timestamp so you can easily use and convert it in Python.

There's no direct feature that outputs the progress over time, but if you have the timestamps, you should be able to easily calculate that, depending on your task, the datasets you're tracking and how you define "progress". Using the Database API in Python you can load all examples from one or more datasets, get their timestamps and then group them by timestamp rounded to week, month, year or whatever else you're interested in. You can then output the number of annotations created in the given time period, or a cumulative sum of all annotations, which is more like a classic "progress" diagram that's always going up. Or you could look for other patterns if you want, like most active day of the week, most active hour of the day etc :smiley:

Just released Prodigy v1.11, which now automatically adds a _timestamp key to all examples! We've also added a new progress command that calculates the annotation progress over time for different intervals, using the _timestamp if available (and the dataset creation time as a fallback): https://prodi.gy/docs/recipes#progress

I implemented this feature and produced a progress table in the session log:

  Answer     Count   Annot per Hour  
 ─────────────────────────────────── 
  n_accept       3              427  
  n_reject       0                0  
  n_skip         0                0  
 ─────────────────────────────────── 
  Total          3              420 

Is there a way to output Annot per Hour using prodigy stats or prodigy progress for specific sessions? I tried this, but got a _timestamp error. I have a progress bar in the UI for each task, but I would like to output the progress table per session on demand, similar to using prodigy progress or prodigy stats to check dataset and session progress/stats.

prodigy progress -i week dataset_name

✔ Loaded 3 annotations from 1 datasets

=================================== Legend ===================================

New      New annotations collected in interval
Total    Total annotations collected   
Unique   Unique examples (not counting multiple annotations of same example)


============================ Annotation Progress ============================

                     New   Unique   Total   Unique
------------------   ---   ------   -----   ------
Sep 2022 (week 39)     3        3       3        3

⚠ No "_timestamp" found in 3 annotations from datasets:
semfaq_test_oct22. Maybe the data was created with Prodigy v1.10 or lower? Using
dataset creation time as a fallback.

Thanks!

In your case, wouldn't it be easier to run such a script on the annotated dataset instead?

prodigy db-out dataset > out.jsonl

Here's an example file I have locally with annotators.

{"text":"stroopwafels are great","_input_hash":506862616,"_task_hash":-1495214589,"label":"truthy","_view_id":"classification","answer":"accept","_timestamp":1666777124,"_annotator_id":"issue-6044-vincent","_session_id":"issue-6044-vincent"}
{"text":"apples are healthy","_input_hash":111541500,"_task_hash":1515955516,"label":"truthy","_view_id":"classification","answer":"accept","_timestamp":1666777125,"_annotator_id":"issue-6044-vincent","_session_id":"issue-6044-vincent"}
{"text":"stroopwafels are great","_input_hash":506862616,"_task_hash":-1495214589,"label":"truthy","_view_id":"classification","answer":"accept","_timestamp":1666777134,"_annotator_id":"issue-6044-jimmy","_session_id":"issue-6044-jimmy"}
{"text":"apples are healthy","_input_hash":111541500,"_task_hash":1515955516,"label":"truthy","_view_id":"classification","answer":"accept","_timestamp":1666777134,"_annotator_id":"issue-6044-jimmy","_session_id":"issue-6044-jimmy"}
{"text":"stroopwafels are great","_input_hash":506862616,"_task_hash":-1495214589,"label":"truthy","_view_id":"classification","answer":"reject","_timestamp":1666777142,"_annotator_id":"issue-6044-lechuck","_session_id":"issue-6044-lechuck"}
{"text":"apples are healthy","_input_hash":111541500,"_task_hash":1515955516,"label":"truthy","_view_id":"classification","answer":"reject","_timestamp":1666777143,"_annotator_id":"issue-6044-lechuck","_session_id":"issue-6044-lechuck"}
{"text":"brussel sprouts are amazing","_input_hash":564254940,"_task_hash":-321962903,"label":"truthy","_view_id":"classification","answer":"reject","_timestamp":1666777527,"_annotator_id":"issue-6044-vincent","_session_id":"issue-6044-vincent"}
{"text":"brussel sprouts are amazing","_input_hash":564254940,"_task_hash":-321962903,"label":"truthy","_view_id":"classification","answer":"reject","_timestamp":1666777537,"_annotator_id":"issue-6044-jimmy","_session_id":"issue-6044-jimmy"}
{"text":"brussel sprouts are amazing","_input_hash":564254940,"_task_hash":-321962903,"label":"truthy","_view_id":"classification","answer":"reject","_timestamp":1666777544,"_annotator_id":"issue-6044-lechuck","_session_id":"issue-6044-lechuck"}
{"text":"it is cold today","_input_hash":718077657,"_task_hash":-363462449,"label":"truthy","_view_id":"classification","answer":"accept","_timestamp":1666878566,"_annotator_id":"issue-6044-guybrush","_session_id":"issue-6044-guybrush"}
{"text":"a wood chuck could chuck a lot of wood if a wood chuck could chuck wood","_input_hash":-1690856185,"_task_hash":1885086500,"label":"truthy","_view_id":"classification","answer":"accept","_timestamp":1666878830,"_annotator_id":"issue-6044-guybrush","_session_id":"issue-6044-guybrush"}

Here's a pandas script that can take such a file and round per hour.

import pandas as pd 

(pd.read_json("out.jsonl", lines=True)
    .assign(dt=lambda d: pd.to_datetime(d['_timestamp'], unit="s").round("H"))
    .groupby("dt")
    .agg(n_text=("_input_hash", "nunique"),
         n_annot=("_annotator_id", "nunique"), 
         n_examples=("_annotator_id", "size")))

Here's the output.

                     n_text  n_annot  n_examples
dt                                              
2022-10-26 10:00:00       3        3           9
2022-10-27 14:00:00       2        1           2

You can customize such a pandas query to your hearts content, but I can imagine that running something like that as a script is the most flexible. Maybe even turn it into a streamlit app or something?

2 Likes