Checking the progress of different annotators

ale · May 8, 2024, 11:12pm

Hi,

Is there a way to see the progress of different annotators for a specific dataset? Something like pgy progress that shows the number of accepted, rejected, and ignored annotations by each annotator when using PRODIGY_ALLOWED_SESSIONS?

Thanks

magdaaniol · May 9, 2024, 10:15am

Hi @ale
We do have progress command, but it reports progress of one or more datasets over time.
There isn't currently a version that would break it down by annotators. You can display stats for a particular dataset by running prodigy stats {dataset_id} but in the case of session/annotator specific datasets you'd have to run this for each of annotators' datasets.

It's a very reasonable thing to have so for now I quickly wrapped up this feature as a custom command:

# progress.py
import os
from collections import Counter

from prodigy.components.db import Database, connect
from prodigy.core import Arg, recipe
from prodigy.errors import RecipeError
from wasabi import msg


def stats(set_id: str, DB: Database) -> None:
    stats = {}
    DB.get_dataset_by_name(set_id)
    examples = DB.get_dataset_examples(set_id)
    meta = DB.get_meta(set_id)
    n_examples = len(examples)
    decisions = Counter()
    for eg in examples:
        if "answer" in eg:
            decisions[eg["answer"]] += 1
        elif "spans" in eg:
            for span in eg["spans"]:
                if "answer" in span:
                    decisions[span["answer"]] += 1
    assert isinstance(meta, dict)
    stats["dataset_stats"] = {
        "dataset": set_id,
        "created": meta.get("created"),
        "description": meta.get("description"),
        "author": meta.get("author"),
        "annotations": n_examples,
        "accept": decisions["accept"],
        "reject": decisions["reject"],
        "ignore": decisions["ignore"],
    }

    for key, values in stats.items():
        title = key.replace("_", " ").title()
        msg.divider(title, icon="emoji")
        if isinstance(values, list):
            msg.text(", ".join(values), spaced=True)
        else:
            msg.table(
                {
                    k.replace("_", " ").title().replace("Spacy", "spaCy"): v
                    for k, v in values.items()
                }
            )


@recipe(
    "stats.progress",
    dataset=Arg(help="Name of the dataset to report progress on."),
)
def progress(
    dataset: str,
):
    allowed_sessions = set(os.getenv("PRODIGY_ALLOWED_SESSIONS").split(","))
    if allowed_sessions is None:
        raise RecipeError(
            "Environment variable `PRODIGY_ALLOWED_SESSIONS` should be set"
        )
    DB = connect()
    if dataset not in DB:
        raise RecipeError(f"Can't find '{dataset}' in database {DB.db_name}")
    session_datasets = DB.get_dataset_sessions(dataset)
    filtered_session_datasets = [
        dataset
        for dataset in session_datasets
        if dataset.split("-")[-1] in allowed_sessions
    ]
    for set_id in filtered_session_datasets:
        stats(set_id, DB)

This is piggy backing on the current stats command but does what you want I think? If you can call it:

PRODIGY_ALLOWED_SESSIONS="bob,alex" python -m prodigy stats.progress {name_of_the_main_dataset} -F progress.py

And that should display the stats for each dataset

=============================== Dataset Stats ===============================

Dataset       sunglasses_brands-bob
Created       2024-05-03 11:39:54
Description   None
Author        None
Annotations   6
Accept        6
Reject        0
Ignore        0


=============================== Dataset Stats ===============================

Dataset       sunglasses_brands-alex
Created       2024-05-03 11:40:08
Description   None
Author        None
Annotations   6
Accept        6
Reject        0
Ignore        0

ale · June 5, 2024, 11:26pm

Thanks! It works nicely

ale · July 23, 2024, 9:42am

Hi @magdaaniol,

I tried running the recipe from above on Prodigy version 1.15.4 and it doesn't seem to be working. Do you why?

Thanks

magdaaniol · July 23, 2024, 5:29pm

Hi @ale,

Is there any error that you're getting? Or what exactly is the unexpected behavior? I just tried running it with 1.15.4 to double check and it worked as expected.

ale · July 26, 2024, 8:50pm

Hi @magdaaniol,

I think I found out the reason why. I exported a database in our server to a JSONL, then imported it to my computer with pgy db-in. However, the import doesn't create the annotator sessions, as I cannot see them with pgy stats database -ls in my computer, but I can see them listed in the server with the same command. If I run the progress.py recipe on the server it displays the output. On my computer it doesn't display any output.

magdaaniol · July 27, 2024, 9:00am

That would be it, yeah. db-in and db-out are atomic commands that are meant to be easy to use in a bash script if imports/exports of multiple datasets are required. There can be many custom filtering configurations when it comes to import/export so we decided it's best to keep it simple and allow the user to implement the filtering logic.

For example, if you store all datasets to be exported in dataset.txt file (one name per line) you could use the following bash script to store all datasets on disk in my_folder

cat datasets.txt | while read line; do python -m prodigy db-out "$line" my_folder; done

Then to import them to another DB:

cat datasets.txt | while read line; do python -m prodigy db-in "$line" my_folder/"$line".jsonl ; done

Topic		Replies	Views
Find out annotator progress usage , ner , database	1	513	March 3, 2021
Customize the "PROGRESS" view usage , front-end , solved	3	476	March 2, 2022
Annotators Performance Tracker enhancement , done	4	1041	November 7, 2022
UI annotation progress does not match number of examples in database usage , database , server	8	562	December 13, 2021
Session Progress Bar Getting Started usage , custom , front-end	2	216	March 21, 2024

Checking the progress of different annotators

Related topics