Prodigy Annotation Task Allocation Issue with Multi-Session Setup

Hi @miguelclaramunt !

Thanks for the detailed report — the behavior you’re seeing is actually consistent with how Prodigy currently makes routing decisions in multi-annotator setups.

  1. When you set PRODIGY_ALLOWED_SESSIONS, the router (route_average_per_task) pre-assigns tasks to the session names in that list. These assignments are essentially reserved in the main stream. A task only becomes "open" (and thus stealable) when an annotator with that session ID connects and is served the task. The task is then moved to that session's _open_tasks list, as you can see in the get_questions method of the Session class ( prodigy/components/session.py).

  2. The steal_work function is designed to take tasks from other sessions that are idle. It does this exclusively by iterating through the _open_tasks of other active Session objects.

  3. If an annotator in your PRODIGY_ALLOWED_ANNOTATORS list never connects, the tasks assigned to them remain in the main stream but never enter anyone's _open_tasks list. Consequently, they cannot be stolen. This is precisely the behavior you're observing.

This was discussed in more detail in an older thread. A couple of relevant points from there:

  • With PRODIGY_ALLOWED_SESSIONS, the router can plan more precisely, but it also assumes that all declared sessions will eventually connect. If some stay inactive, their share of tasks can remain locked.
  • Most importantly, it’s not obvious when a session should be considered “deprecated,” so the system errs on the side of keeping its reservations.
  • Work stealing only redistributes when a session “wakes up,” not proactively in the background. And this is perhaps one thing we could make configurable in future versions.

As suggested in the thread I mentioned, you could work around the absent, but registered sessions by running a startup session initialization script that calls /get_session_questions once for each session right after launching the Prodigy server. By simulating a connection from each allowed annotator, you create Session objects for them and, by having them request one batch of tasks, you populate their _open_tasks, making those tasks available for stealing.
Here's example of such script:

import requests

PRODIGY_URL = "http://localhost:8080"

# Must match dataset-session_name pattern
SESSIONS = ["test-anna", "test-marina", "test-rosa", "test-miguel", "test-francesca"]

def initialize_sessions():
    """
    Call /get_session_questions once for each allowed session to
    populate their queues and trigger the router.
    """
    for session in SESSIONS:
        try:
            r = requests.post(
                f"{PRODIGY_URL}/get_session_questions",
                json={"session_id": session}
            )
            if r.status_code == 200:
                tasks = r.json().get("tasks", [])
                print(f"Session {session}: {len(tasks)} tasks initialized")
            else:
                print(f"Session {session} failed: {r.status_code} {r.text}")
        except Exception as e:
            print(f"Session {session} error: {e}")

if __name__ == "__main__":
    initialize_sessions()

Please not that in the POST request you should use dataset-session_name pattern e.g my_dataset-anna as this is the session ID that Prodigy creates during the Controller initialization.

1 Like