Checking the progress of different annotators

Hi,

Is there a way to see the progress of different annotators for a specific dataset? Something like pgy progress that shows the number of accepted, rejected, and ignored annotations by each annotator when using PRODIGY_ALLOWED_SESSIONS?

Thanks

Hi @ale
We do have progress command, but it reports progress of one or more datasets over time.
There isn't currently a version that would break it down by annotators. You can display stats for a particular dataset by running prodigy stats {dataset_id} but in the case of session/annotator specific datasets you'd have to run this for each of annotators' datasets.

It's a very reasonable thing to have so for now I quickly wrapped up this feature as a custom command:

# progress.py
import os
from collections import Counter

from prodigy.components.db import Database, connect
from prodigy.core import Arg, recipe
from prodigy.errors import RecipeError
from wasabi import msg


def stats(set_id: str, DB: Database) -> None:
    stats = {}
    DB.get_dataset_by_name(set_id)
    examples = DB.get_dataset_examples(set_id)
    meta = DB.get_meta(set_id)
    n_examples = len(examples)
    decisions = Counter()
    for eg in examples:
        if "answer" in eg:
            decisions[eg["answer"]] += 1
        elif "spans" in eg:
            for span in eg["spans"]:
                if "answer" in span:
                    decisions[span["answer"]] += 1
    assert isinstance(meta, dict)
    stats["dataset_stats"] = {
        "dataset": set_id,
        "created": meta.get("created"),
        "description": meta.get("description"),
        "author": meta.get("author"),
        "annotations": n_examples,
        "accept": decisions["accept"],
        "reject": decisions["reject"],
        "ignore": decisions["ignore"],
    }

    for key, values in stats.items():
        title = key.replace("_", " ").title()
        msg.divider(title, icon="emoji")
        if isinstance(values, list):
            msg.text(", ".join(values), spaced=True)
        else:
            msg.table(
                {
                    k.replace("_", " ").title().replace("Spacy", "spaCy"): v
                    for k, v in values.items()
                }
            )


@recipe(
    "stats.progress",
    dataset=Arg(help="Name of the dataset to report progress on."),
)
def progress(
    dataset: str,
):
    allowed_sessions = set(os.getenv("PRODIGY_ALLOWED_SESSIONS").split(","))
    if allowed_sessions is None:
        raise RecipeError(
            "Environment variable `PRODIGY_ALLOWED_SESSIONS` should be set"
        )
    DB = connect()
    if dataset not in DB:
        raise RecipeError(f"Can't find '{dataset}' in database {DB.db_name}")
    session_datasets = DB.get_dataset_sessions(dataset)
    filtered_session_datasets = [
        dataset
        for dataset in session_datasets
        if dataset.split("-")[-1] in allowed_sessions
    ]
    for set_id in filtered_session_datasets:
        stats(set_id, DB)

This is piggy backing on the current stats command but does what you want I think? If you can call it:

PRODIGY_ALLOWED_SESSIONS="bob,alex" python -m prodigy stats.progress {name_of_the_main_dataset} -F progress.py

And that should display the stats for each dataset

=============================== Dataset Stats ===============================

Dataset       sunglasses_brands-bob
Created       2024-05-03 11:39:54
Description   None
Author        None
Annotations   6
Accept        6
Reject        0
Ignore        0


=============================== Dataset Stats ===============================

Dataset       sunglasses_brands-alex
Created       2024-05-03 11:40:08
Description   None
Author        None
Annotations   6
Accept        6
Reject        0
Ignore        0
1 Like

Thanks! It works nicely

1 Like

Hi @magdaaniol,

I tried running the recipe from above on Prodigy version 1.15.4 and it doesn't seem to be working. Do you why?

Thanks

Hi @ale,

Is there any error that you're getting? Or what exactly is the unexpected behavior? I just tried running it with 1.15.4 to double check and it worked as expected.

Hi @magdaaniol,

I think I found out the reason why. I exported a database in our server to a JSONL, then imported it to my computer with pgy db-in. However, the import doesn't create the annotator sessions, as I cannot see them with pgy stats database -ls in my computer, but I can see them listed in the server with the same command. If I run the progress.py recipe on the server it displays the output. On my computer it doesn't display any output.

That would be it, yeah. db-in and db-out are atomic commands that are meant to be easy to use in a bash script if imports/exports of multiple datasets are required. There can be many custom filtering configurations when it comes to import/export so we decided it's best to keep it simple and allow the user to implement the filtering logic.

For example, if you store all datasets to be exported in dataset.txt file (one name per line) you could use the following bash script to store all datasets on disk in my_folder

cat datasets.txt | while read line; do python -m prodigy db-out "$line" my_folder; done

Then to import them to another DB:

cat datasets.txt | while read line; do python -m prodigy db-in "$line" my_folder/"$line".jsonl ; done 
1 Like