Correct Audio Transcription

Thank you, this was very helpful.

One problem I had with this was when I tried to add multiple examples, prodigy would only show one of them and drop others as duplicates.

Calling set_hashes()

examples = [set_hashes(example, overwrite = True, input_keys = ("audio"), task_keys = ("audio")) for example in first_pass_transcripts]

I print out task and input hashes for all examples:

DEBUG:root: example TASK hashes:
DEBUG:root:[2025448499, -677278922, 144250741, -581610436]
DEBUG:root: example INPUT hashes:
DEBUG:root:[819208844, 1154496076, -2057892666, 61420621]

I call the python script that has the same prodigy.serve call with audio.transcribe

INFO:prodigy:DB: Creating dataset 'asr_trial'
INFO:prodigy:DB: Getting dataset 'asr_trial'
INFO:prodigy:DB: Added 4 examples to 1 datasets
INFO:prodigy:RECIPE: Calling recipe 'audio.transcribe'
INFO:prodigy:RECIPE: Starting recipe audio.transcribe
INFO:prodigy:LOADER: Loading stream from dataset asr_trial (answer: all)
INFO:prodigy:DB: Loading dataset 'asr_trial' (4 examples)
INFO:prodigy:LOADER: Rehashing stream
INFO:prodigy:VALIDATE: Validating components returned by recipe
INFO:prodigy:CONTROLLER: Initialising from recipe
INFO:prodigy:VALIDATE: Creating validator for view ID 'blocks'
INFO:prodigy:VALIDATE: Validating Prodigy and recipe config
INFO:prodigy:DB: Creating dataset 'asr_trial_corrected'
INFO:prodigy:DB: Creating dataset '2021-04-16_17-23-10'
INFO:prodigy:CONTROLLER: Initialising from recipe
INFO:prodigy:CONTROLLER: Validating the first batch for session: None
INFO:prodigy:PREPROCESS: Fetching media if necessary: ['audio', 'video']
INFO:prodigy:FILTER: Filtering duplicates from stream
INFO:prodigy:CORS: initialized with wildcard "*" CORS origins

✨  Starting the web server at http://localhost:8080 ...

When I open up the UI, shows the first example with the audio & transcription, but when I accept/reject/ignore that then shows No tasks available.

INFO:prodigy:POST: /get_session_questions
INFO:prodigy:FEED: Finding next batch of questions in stream
INFO:prodigy:FEED: skipped: -1607572746
INFO:prodigy:RESPONSE: /get_session_questions (0 examples)


  1. What might be the reason that it's skipping other examples?
  2. Why is the hash that is skipped different than any hash I have in my input?

I also tried doing it via command line to have the same issue with duplicates. After writing all examples to a .jsonl, I tried to load it via CLI:

prodigy audio.transcribe asr_trial_corrected ./data/transcriptions/prodigy_input_aws_transcribe.jsonl --fetch-media --loader jsonl

Still shows No tasks available after the first example.

17:55:11: FEED: Finding next batch of questions in stream
⚠ Warning: filtered 75% of entries because they were duplicates. Only 1 items
were shown out of 4. You may want to deduplicate your dataset ahead of time to
get a better understanding of your dataset size.
17:55:11: RESPONSE: /get_session_questions (1 examples)
INFO:     ::1:55235 - "POST /get_session_questions HTTP/1.1" 200 OK
17:55:13: POST: /get_session_questions
17:55:13: FEED: Finding next batch of questions in stream
17:55:13: FEED: skipped: 21806991
17:55:13: RESPONSE: /get_session_questions (0 examples)
INFO:     ::1:55235 - "POST /get_session_questions HTTP/1.1" 200 OK

It looks like the main problem here is that all examples except for one are filtered out because they're considered duplicates, mosty likely because they ended up with the same hashes.

One potential problem is here:

The keys should be iterables of strings, but ("audio") ends up being "audio". So you'd either want this to be ("audio",) or ["audio"].

EDIT: Solved. Details in EDIT section below.

Thanks for the prompt answer but even with ("audio",) or ["audio"] I am having the same issue. I don't think the set_hashes() mapping to the same hash is the issue since after calling the function, I log both task and input hashes and they all seem to be different:

DEBUG:root:Inputing example TASK hashes:
DEBUG:root:[-1464920486, 1203448608, -244439580, 1932216617]
DEBUG:root:Inputing example INPUT hashes:
DEBUG:root:[158202857, 1296278939, 318945175, 549324070]

What might be another reason for this? Should I also delete the "Session" datasets before rerunning the script (I run prodigy drop input_dataset and corrected_dataset after each run of the script)?

One thing I noticed was the script is rehashing the stream:

INFO:prodigy:DB: Loading dataset 'asr_trial' (4 examples)
INFO:prodigy:LOADER: Rehashing stream

maybe this is causing the hashes to be the same because I have no control over the parameters of this hashing?

EDIT: For each example in the stream, the text field was ''. When I passed it the audio path to as text field, there were no duplicates and all 4 examples show up in UI. This supports the hypothesis that Prodigy is re-hashing the examples input to the dataset by their text fields. Is there any way to circumvent this?

Ah, thanks for the analysis, it looks like the recipe is indeed rehashing the stream :thinking: I can't think of a good reason why this is done in this particular recipe, so we should remove that. The eaiest workaround is to just remove the rehash=True in recipes/ (you can run prodigy stats to find the location of your Prodigy installation).

Edit: Fixed in v1.11!