Overlap between annotators for calculating inter-annotator agreement

Hey!

I've read the documentation for having multiple annotators annotate same item, but I still need clarification for a few issues.

First of all, I find it very difficult to wrap my head around annotation overlap at the level of a single item, e.g. having 1.2 annotations per item. What does this actually mean? That 20% of the data has been annotated by two annotators?

We're using Prodigy for academic research, so we would ideally like to have more control over the overlapping tasks for measuring agreement between annotators and modelling their competence. Is it possible to set up the overlap so that 30% of the items will be annotated by three different annotators whereas the remaining 70% are annotated by a single annotator?

1 Like

Hi @tuomo_h,

I realize that in this case it's harder to anticipate all the outcomes without access to the underlying implementation details.
"1.2 annotations per item" means that, on average, each task will receive 1.2 annotations. This is achieved by having some tasks annotated by 1 person and others by 2 people.

To break down how this works:

  1. Every task will have at least 1 annotator (the integer part of 1.2).
  2. Some tasks will get an additional annotator, determined probabilistically.
  3. The percentage of tasks that get 2 annotators is determined by the decimal part (0.2 or 20%).

The router implements this logic in two phases:

  • First, it assigns the whole number of annotators (1 in this case) to every task
  • Then, it applies a probabilistic method to assign a second annotator to some tasks

When average=1.2:

  • 80% of tasks will be annotated by 1 person
  • 20% of tasks will be annotated by 2 people (as you said)

The code specifically handles this in these lines:

# In case of average=1.5 we need to do something probalistic.
if len(annot) < annots_needed:
    if len(pool) == 0:
        log_router(hash_attr, item_hash, annot)
        return annot
    prob_from_hash = h / 1000 % 1
    prob_required = annots_needed % 1
    if prob_from_hash < prob_required:
        idx = h % len(pool)
        annot.append(pool.pop(idx))

Where:

  • prob_required = annots_needed % 1 gives the decimal part (0.2 for 1.2)
  • prob_from_hash generates a pseudo-random number between 0 and 1
  • If this random number is less than 0.2, the task gets a second annotator

This approach ensures that across a sufficiently large number of tasks, the average number of annotations per task will converge to 1.2, while distributing the double annotations randomly but deterministically (based on the task hash).

As you can probably see already, there's one important implication here: it's designed for fractional assignments between N and N+1 annotators (where N is the integer part of the average). The router does not naturally support jumping from 1 to 3 annotators. The current implementation only supports incremental increases (1→2→3) rather than jumping directly from 1 to 3 for a specific percentage of tasks.
Just to give a bit more context: this is a good solution for the scenarios where the final pool of annotators is unknown upfront or can change throughout the project. Also, Prodigy handles data as streams, which allows for memory efficient handling of large datasets but the total number of examples is, usually, not known upfront. The router makes decisions task-by-task, trying to balance the distribution based on the current context which is the most sensible solution for this scenario.

In order to fulfill your requirement you'd need a custom router which assigns either to one annotator or to 3 annotators:

def _task_router(ctrl, session_id, item):
    single_pct=0.7
    multi_count=3
    hash_attr = "task" if ctrl.exclude_by == "task" else "input"
    item_hash = item.get(TASK_HASH_ATTR) if hash_attr == "task" else item.get(INPUT_HASH_ATTR)
    
    # Check existing annotation count
    hash_count = ctrl.db.get_hash_count(ctrl.dataset, hash=item_hash, kind=ctrl.exclude_by)
    
    # Determine target annotators
    is_single_annotator = (item_hash % 100) < (single_pct * 100)
    target_annotators = 1 if is_single_annotator else multi_count
    
    # Early exit conditions
    if hash_count >= target_annotators:
        log_router(hash_attr, item_hash, [], "Already has enough annotations")
        return []  # Already has enough annotations
    
    if hash_count > 0 and hash_count < multi_count and not is_single_annotator:
        # For multi-annotator tasks, only assign if we can assign ALL remaining annotators at once
        needed = target_annotators - hash_count
        if len(ctrl.session_ids) < needed:
            log_router(hash_attr, item_hash, [], "Not enough annotators available")
            return []  # Not enough annotators available to complete this properly
    
    # Get available annotators pool
    pool = ctrl.session_ids.copy()  # Create a copy to avoid modifying the original
    
    if hash_count > 0:
        # Get annotators who already did this task
        annot_examples = ctrl.db.get_dataset_examples_by_hash(
            ctrl.dataset, hash=item_hash, kind=ctrl.exclude_by
        )
        already_annotated = [ex["_session_id"] for ex in annot_examples]
        pool = [u for u in pool if u not in already_annotated]
    
    # Assign annotators
    annot = []
    needed = target_annotators - hash_count
    
    # Only proceed if we can assign all needed annotators at once
    if len(pool) >= needed:
        while len(annot) < needed and pool:
            idx = item_hash % len(pool)
            annot.append(pool.pop(idx))
    log_router(hash_attr, item_hash, annot, "Normal routing")
    return annot

This approach, however, would only work under the assumption that there are always 3 annotators available (while len(annot) < needed and pool). If not enough annotators are available, the question won't be routed because the router refuses to send it to only 1 or 2 annotators when 3 are required. You might want to modify this router to fallback to assigning just 1 annotator rather than discarding the example entirely, but this would introduce uncertainty in your distribution depending on annotator availability. These availability issues can be addressed by setting the PRODIGY_KNOWN_SESSIONS variable (so that the controller is aware at all times of all available annotators) and setting work_stealing to false to prevent slower annotators from getting their tasks. Depending on how many examples you have, reducing the batch size can also help improve the precision of the final split.

Still, given the probabilistic nature of the assignment, the exact 30/70 split cannot be guaranteed. It should converge to it with a large enough sample of examples, but it cannot be guaranteed.

If you need precise control over exactly which items get how many annotations, consider pre-allocating your annotation assignments by:

  1. Processing your dataset upfront
  2. Adding annotation assignment information to each task's metadata (e.g., "meta": {"target_annotator": "jane"})
  3. Creating a custom router that reads this metadata and assigns annotators accordingly

This approach gives you exact control over the distribution and which specific items receive multiple annotations, which can be valuable for research purposes where you either need exact splits for IAA calculation or you might want specific tasks or more difficult items to have higher coverage.

2 Likes

Wow @magdaaniol, thanks for a super informative reply – much appreciated! This will help us a lot!

2 Likes