Still shows No tasks available after the first example.
17:55:11: FEED: Finding next batch of questions in stream
⚠ Warning: filtered 75% of entries because they were duplicates. Only 1 items
were shown out of 4. You may want to deduplicate your dataset ahead of time to
get a better understanding of your dataset size.
17:55:11: RESPONSE: /get_session_questions (1 examples)
INFO: ::1:55235 - "POST /get_session_questions HTTP/1.1" 200 OK
17:55:13: POST: /get_session_questions
17:55:13: FEED: Finding next batch of questions in stream
17:55:13: FEED: skipped: 21806991
17:55:13: RESPONSE: /get_session_questions (0 examples)
INFO: ::1:55235 - "POST /get_session_questions HTTP/1.1" 200 OK
It looks like the main problem here is that all examples except for one are filtered out because they're considered duplicates, mosty likely because they ended up with the same hashes.
One potential problem is here:
The keys should be iterables of strings, but ("audio") ends up being "audio". So you'd either want this to be ("audio",) or ["audio"].
Thanks for the prompt answer but even with ("audio",) or ["audio"] I am having the same issue. I don't think the set_hashes() mapping to the same hash is the issue since after calling the function, I log both task and input hashes and they all seem to be different:
DEBUG:root:Inputing example TASK hashes:
DEBUG:root:[-1464920486, 1203448608, -244439580, 1932216617]
DEBUG:root:Inputing example INPUT hashes:
DEBUG:root:[158202857, 1296278939, 318945175, 549324070]
What might be another reason for this? Should I also delete the "Session" datasets before rerunning the script (I run prodigy drop input_dataset and corrected_dataset after each run of the script)?
One thing I noticed was the script is rehashing the stream:
maybe this is causing the hashes to be the same because I have no control over the parameters of this hashing?
EDIT: For each example in the stream, the text field was ''. When I passed it the audio path to as text field, there were no duplicates and all 4 examples show up in UI. This supports the hypothesis that Prodigy is re-hashing the examples input to the dataset by their text fields. Is there any way to circumvent this?
Ah, thanks for the analysis, it looks like the recipe is indeed rehashing the stream I can't think of a good reason why this is done in this particular recipe, so we should remove that. The eaiest workaround is to just remove the rehash=True in recipes/audio.py (you can run prodigy stats to find the location of your Prodigy installation).