Default router seems to route more than get_sessions receives

I'm using a custom recipe, but haven't overridden the router or done anything strange with the controller. My stream yields 5 context dicts (I have put logging right above the yield to confirm this), but /get_session_questions only yields 2 examples.

15:38:58: POST: /get_session_questions
15:38:58: CONTROLLER: Getting batch of questions for session: None
15:40:01: ROUTER: Routing item with _input_hash=1765030649 -> ['2024-01-31_15-36-47']
15:40:25: ROUTER: Routing item with _input_hash=4956623 -> ['2024-01-31_15-36-47']
15:40:41: ROUTER: Routing item with _input_hash=573831279 -> ['2024-01-31_15-36-47']
15:41:31: ROUTER: Routing item with _input_hash=265572234 -> ['2024-01-31_15-36-47']
15:41:52: ROUTER: Routing item with _input_hash=884592512 -> ['2024-01-31_15-36-47']
15:41:52: RESPONSE: /get_session_questions (2 examples)

What could be going on?

IIRC the Controller object is distributed as a shared object so I can't see it to help debug, sorry!

This is with prodigy 1.14.12 or 1.14.14 and Python 3.10

Anyone have an idea what is going on here?

I Imagine it's some kind of either timeout or default behavior that I'm unaware of.

Hi @peter-axion ,

Could it be that the tasks that don't make it to the session queue are already in the database in this session's dataset?
The function that enqueues questions from the stream via router filters out all the input hashes or task hashes (depending on your exclude_by config setting))
That's the most obvious thing that could happen. Let's exclude that first. If it's not the case, I'll try to reproduce the problem.

1 Like

Thank you for the reply! This is almost certainly what is happening.

I am deduplicating by the input hash, so I should be able to use input_hash = prodigy.set_hashes({"text": text})["_input_hash"] and do a DB dip to pull that dedupe step forward in the process before I spend time applying the model.

I confirmed it works!

I had assumed that the exclusion was applied within the get_stream() function, but it makes sense that it wasn't because I am not sure it can calculate an authoritative task_hash there.

Thanks for the help!

That's right, the task_hashes need to be considered on the session level as what should be excluded also depends on the router/feed overlap settings.
Glad to hear it's sorted out and thanks for reporting back!