I'm trying to gain a better understanding of how task routing works, especially work stealing. Under what conditions can a task be "stolen" from one session and assigned to another?
Additionally, the documentation states that work stealing will ensure that each task is annotated "at least once", but will it ensure that each task is annotated at least n times if annotations_per_task set?
Under what conditions can a task be "stolen" from one session and assigned to another?
There are 3 conditions for work stealing to occur:
allow_work_stealing is set to True (this is the default, but can be set to False via prodigy.json)
A session has exhausted its own available tasks (as determined by the selected task router)
There are other sessions available with unanswered tasks
In practical terms, work stealing typically happens towards the end of the annotation stream when some sessions become idle. Tasks are stolen from the least active sessions first.
Will work stealing ensure that each task is annotated at least n times if annotations_per_task is set?
The annotations_per_task setting and work stealing operate independently:
The system first tries to fulfill the annotations_per_task requirement using the normal task distribution.
Only if this yields no more tasks to be annotated does work stealing come into play, ensuring each task is annotated at least once.
Generally, the annotations_per_task setting should be respected even with work stealing enabled.
However, edge cases can occur. For example, if you have 3 annotators, request 2 annotations per example, and one annotator opens a session but never returns, most or all annotations may come from the remaining 2 annotators.
Remember, work stealing can be disabled by setting allow_work_stealing to False in prodigy.json if needed.
Thanks, @magdaaniol, for the detailed explanation! We want work stealing for our task, so these details will let us confirm we're seeing the correct behavior in testing.
It looks like our initial confusion seems to be tied to setting annotations_per_task: 2 without also setting PRODIGY_ALLOWED_SESSIONS. This resulted in the tasks in the initial batch assignment only being annotated once (presumably because there was only one known session).
This resulted in the tasks in the initial batch assignment only being annotated once (presumably because there was only one known session).
That's right. This is the limitation of the current routing mechanism in that each task is routed only once based on the current list of known sessions.