Hi @ale ,
The observed difference between these two overlap settings is due to the way work_stealing
is applied. In feed_overlap
the work_stealing is applied until there are no unsaved examples left in any of the sessions. That is why you could have gotten to the end of the queue annotating with joe
only. In the case of annotations_per_task
the work stealing is applied only until the estimated target for joe
has been reached which would be ca 75 examples (taking the 150 input). The remaining questions will be queued for jane
until she gets to them.
The idea behind annotations_per_task
setting is to approximate this target as closely as possible given the available pool of the annotators. If more than annotation per task is configured, it is expected there will be enough active annotators to reach that target. If you can't assume there will be enough active annotators to reach the target, it's better to use feed_overlap
setting (especially if you have 2 annotators and want one annotator to annotated whatever the other annotator didn't).
About your second experiment with 1.2 annotations per task.
The 1.2 condition is applied on each pull to the main queue. Everytime joe
asks for questions, the task router first tries to satisfy the whole number part of the fraction i.e. 1. The to handle the fraction part i.e. 0.2, it computes the probability where the task should be sent. Effectively, some tasks are sent to joe
, some to alice
and some to both.
If only joe
is annotating, the probability of sending the task to jane
would increase given the 1.2 condition and the fact that she is expected to annotate as well. This is why joe
is allowed to steal less and less as he progresses (and this is why you're observing numbers smaller than the batch size).
It's a probability based mechanism because it's hard to know upfront how many you should send in total to one or another. Since the input files can be huge or have undefined size the Controller cannot know upfront the total, which is why it needs to be estimated on batch-basis. In other words, the task router takes "local" decisions trying to fulfill the conditions as best as possible given current conditions.
This post explains the mechanism a bit more: How does `annotations_per_task : 2.5` work.
Is there a way to ensure that the 30 examples that will have 2 annotations get such annotations from different annotators (in this case,
jane
andjoe
)?
The multiple annotations resulting from task router settins (overlap or annotations_per_example) always come from different annotators.
As mentioned above, especially the fraction value of annotations_per_task is the number of annotations per task that you can expect on average. Task router will try to fulfill this target as best as possible based on the progress of annotations. If all annotators defined in ALLOWED_SESSIONS are active, with enough number of annotations (cf. the proablistic based assignemnt) the final numbers will converge to the sttetting.
If you'd rather have a more precise router because you can afford precomputing the total or even the queues, you can always implement your custom task router, see here for some examples of how it can be done: Task Routing · Prodigy · An annotation tool for AI, Machine Learning & NLP