Hi @tuomo_h,
I realize that in this case it's harder to anticipate all the outcomes without access to the underlying implementation details.
"1.2 annotations per item" means that, on average, each task will receive 1.2 annotations. This is achieved by having some tasks annotated by 1 person and others by 2 people.
To break down how this works:
- Every task will have at least 1 annotator (the integer part of 1.2).
- Some tasks will get an additional annotator, determined probabilistically.
- The percentage of tasks that get 2 annotators is determined by the decimal part (0.2 or 20%).
The router implements this logic in two phases:
- First, it assigns the whole number of annotators (1 in this case) to every task
- Then, it applies a probabilistic method to assign a second annotator to some tasks
When average=1.2
:
- 80% of tasks will be annotated by 1 person
- 20% of tasks will be annotated by 2 people (as you said)
The code specifically handles this in these lines:
# In case of average=1.5 we need to do something probalistic.
if len(annot) < annots_needed:
if len(pool) == 0:
log_router(hash_attr, item_hash, annot)
return annot
prob_from_hash = h / 1000 % 1
prob_required = annots_needed % 1
if prob_from_hash < prob_required:
idx = h % len(pool)
annot.append(pool.pop(idx))
Where:
prob_required = annots_needed % 1
gives the decimal part (0.2 for 1.2)
prob_from_hash
generates a pseudo-random number between 0 and 1
- If this random number is less than 0.2, the task gets a second annotator
This approach ensures that across a sufficiently large number of tasks, the average number of annotations per task will converge to 1.2, while distributing the double annotations randomly but deterministically (based on the task hash).
As you can probably see already, there's one important implication here: it's designed for fractional assignments between N and N+1 annotators (where N is the integer part of the average). The router does not naturally support jumping from 1 to 3 annotators. The current implementation only supports incremental increases (1→2→3) rather than jumping directly from 1 to 3 for a specific percentage of tasks.
Just to give a bit more context: this is a good solution for the scenarios where the final pool of annotators is unknown upfront or can change throughout the project. Also, Prodigy handles data as streams, which allows for memory efficient handling of large datasets but the total number of examples is, usually, not known upfront. The router makes decisions task-by-task, trying to balance the distribution based on the current context which is the most sensible solution for this scenario.
In order to fulfill your requirement you'd need a custom router which assigns either to one annotator or to 3 annotators:
def _task_router(ctrl, session_id, item):
single_pct=0.7
multi_count=3
hash_attr = "task" if ctrl.exclude_by == "task" else "input"
item_hash = item.get(TASK_HASH_ATTR) if hash_attr == "task" else item.get(INPUT_HASH_ATTR)
# Check existing annotation count
hash_count = ctrl.db.get_hash_count(ctrl.dataset, hash=item_hash, kind=ctrl.exclude_by)
# Determine target annotators
is_single_annotator = (item_hash % 100) < (single_pct * 100)
target_annotators = 1 if is_single_annotator else multi_count
# Early exit conditions
if hash_count >= target_annotators:
log_router(hash_attr, item_hash, [], "Already has enough annotations")
return [] # Already has enough annotations
if hash_count > 0 and hash_count < multi_count and not is_single_annotator:
# For multi-annotator tasks, only assign if we can assign ALL remaining annotators at once
needed = target_annotators - hash_count
if len(ctrl.session_ids) < needed:
log_router(hash_attr, item_hash, [], "Not enough annotators available")
return [] # Not enough annotators available to complete this properly
# Get available annotators pool
pool = ctrl.session_ids.copy() # Create a copy to avoid modifying the original
if hash_count > 0:
# Get annotators who already did this task
annot_examples = ctrl.db.get_dataset_examples_by_hash(
ctrl.dataset, hash=item_hash, kind=ctrl.exclude_by
)
already_annotated = [ex["_session_id"] for ex in annot_examples]
pool = [u for u in pool if u not in already_annotated]
# Assign annotators
annot = []
needed = target_annotators - hash_count
# Only proceed if we can assign all needed annotators at once
if len(pool) >= needed:
while len(annot) < needed and pool:
idx = item_hash % len(pool)
annot.append(pool.pop(idx))
log_router(hash_attr, item_hash, annot, "Normal routing")
return annot
This approach, however, would only work under the assumption that there are always 3 annotators available (while len(annot) < needed and pool
). If not enough annotators are available, the question won't be routed because the router refuses to send it to only 1 or 2 annotators when 3 are required. You might want to modify this router to fallback to assigning just 1 annotator rather than discarding the example entirely, but this would introduce uncertainty in your distribution depending on annotator availability. These availability issues can be addressed by setting the PRODIGY_KNOWN_SESSIONS
variable (so that the controller is aware at all times of all available annotators) and setting work_stealing
to false to prevent slower annotators from getting their tasks. Depending on how many examples you have, reducing the batch size can also help improve the precision of the final split.
Still, given the probabilistic nature of the assignment, the exact 30/70 split cannot be guaranteed. It should converge to it with a large enough sample of examples, but it cannot be guaranteed.
If you need precise control over exactly which items get how many annotations, consider pre-allocating your annotation assignments by:
- Processing your dataset upfront
- Adding annotation assignment information to each task's metadata (e.g.,
"meta": {"target_annotator": "jane"}
)
- Creating a custom router that reads this metadata and assigns annotators accordingly
This approach gives you exact control over the distribution and which specific items receive multiple annotations, which can be valuable for research purposes where you either need exact splits for IAA calculation or you might want specific tasks or more difficult items to have higher coverage.