Inconsistency Number of Annotated Data

Hi all,

I am doing NER and text classification annotation using a custom recipe for multi-annotators using a named session. The annotation process runs well, but in the end, there is inconsistency in the number of data. The size of my data is 740. In one annotator, the "no task available" message appears when the annotator finishes 727 annotations. In the other annotator, the "no task available" message appears when the annotator finishes 734 annotations.

How to make sure that both of the annotators can annotate all of the data (740)? How to check it using a Python script to detect which data has not been annotated yet at least by one annotator?

Hi @sigitpurnomo ,

Do you have any setting related to task overlap between the annotators such as feed_overlap, annotations_per_task or work_stealing? Do you define your named sessions up front with PRODIGY_ALLOWED_SESSIONS?

In principle, to make sure all annotators annotate all examples you should set feed_overlap: true in your prodigy.json. That's it.

Now, in order to check if there are any examples with fewer than expected annotations in the DB you can use the following python script:

from prodigy.components.db import connect
import json

def find_missing_annotations(dataset_name):
    db = connect()
    annotations = db.get_dataset_examples(dataset_name)
    
    # Group by task hash and count annotators
    task_counts = {}
    for ann in annotations:
        task_hash = ann['_task_hash']
        task_counts[task_hash] = task_counts.get(task_hash, 0) + 1

    # Find tasks with < 2 annotations
    incomplete = {hash: count for hash, count in task_counts.items() if count < 2}
    
    # Get original tasks for these hashes
    missing_tasks = [ann for ann in annotations if ann['_task_hash'] in incomplete]
    
    print(f"Found {len(incomplete)} tasks with incomplete annotation coverage")
    return missing_tasks

# Usage
missing = find_missing_annotations('datasetname')

Or if it's easier you can get it directly on CLI with the jq tool:

python -m prodigy db-out datasetname | jq -s 'group_by(._task_hash) | map(select(length < 2)) | length'

And this is to get number of annotations per annotator:

prodigy db-out datasetname | jq -s 'group_by(._annotator_id) | map({annotator: .[0]._annotator_id, count: length})'

Hi @magdaaniol

Thank you for your response.

I have set the feed_overlap to true in my prodigy.json file. The annotations_per_task is null, and the allow_work_stealing is false. I also defined the named sessions using the PRODIGY_ALLOWED_SESSIONS.

I have tried your script, and the result is Found 7 tasks with incomplete annotation coverage. I suggest the result is from the second annotator, that annotated 734 data with 1 data is ignored. But how about the first annotator that annotated just 727 data? How can the incomplete task be displayed again in the interface to annotate?