Correct Audio Transcription

atakanokan · April 17, 2021, 12:30am

Thank you, this was very helpful.

One problem I had with this was when I tried to add multiple examples, prodigy would only show one of them and drop others as duplicates.

Calling set_hashes()

examples = [set_hashes(example, overwrite = True, input_keys = ("audio"), task_keys = ("audio")) for example in first_pass_transcripts]

I print out task and input hashes for all examples:

DEBUG:root: example TASK hashes:
DEBUG:root:[2025448499, -677278922, 144250741, -581610436]
DEBUG:root: example INPUT hashes:
DEBUG:root:[819208844, 1154496076, -2057892666, 61420621]

I call the python script that has the same prodigy.serve call with audio.transcribe

...
INFO:prodigy:DB: Creating dataset 'asr_trial'
...
INFO:prodigy:DB: Getting dataset 'asr_trial'
...
INFO:prodigy:DB: Added 4 examples to 1 datasets
...
INFO:prodigy:RECIPE: Calling recipe 'audio.transcribe'
INFO:prodigy:RECIPE: Starting recipe audio.transcribe
INFO:prodigy:LOADER: Loading stream from dataset asr_trial (answer: all)
...
INFO:prodigy:DB: Loading dataset 'asr_trial' (4 examples)
...
INFO:prodigy:LOADER: Rehashing stream
INFO:prodigy:VALIDATE: Validating components returned by recipe
INFO:prodigy:CONTROLLER: Initialising from recipe
INFO:prodigy:VALIDATE: Creating validator for view ID 'blocks'
INFO:prodigy:VALIDATE: Validating Prodigy and recipe config
...
INFO:prodigy:DB: Creating dataset 'asr_trial_corrected'
...
INFO:prodigy:DB: Creating dataset '2021-04-16_17-23-10'
...
INFO:prodigy:CONTROLLER: Initialising from recipe
INFO:prodigy:CONTROLLER: Validating the first batch for session: None
INFO:prodigy:PREPROCESS: Fetching media if necessary: ['audio', 'video']
INFO:prodigy:FILTER: Filtering duplicates from stream
INFO:prodigy:CORS: initialized with wildcard "*" CORS origins

✨  Starting the web server at http://localhost:8080 ...

When I open up the UI, shows the first example with the audio & transcription, but when I accept/reject/ignore that then shows No tasks available.

INFO:prodigy:POST: /get_session_questions
INFO:prodigy:FEED: Finding next batch of questions in stream
INFO:prodigy:FEED: skipped: -1607572746
INFO:prodigy:RESPONSE: /get_session_questions (0 examples)

@ines

What might be the reason that it's skipping other examples?
Why is the hash that is skipped different than any hash I have in my input?

atakanokan · April 17, 2021, 12:57am

I also tried doing it via command line to have the same issue with duplicates. After writing all examples to a .jsonl, I tried to load it via CLI:

prodigy audio.transcribe asr_trial_corrected ./data/transcriptions/prodigy_input_aws_transcribe.jsonl --fetch-media --loader jsonl

Still shows No tasks available after the first example.

17:55:11: FEED: Finding next batch of questions in stream
⚠ Warning: filtered 75% of entries because they were duplicates. Only 1 items
were shown out of 4. You may want to deduplicate your dataset ahead of time to
get a better understanding of your dataset size.
17:55:11: RESPONSE: /get_session_questions (1 examples)
INFO:     ::1:55235 - "POST /get_session_questions HTTP/1.1" 200 OK
17:55:13: POST: /get_session_questions
17:55:13: FEED: Finding next batch of questions in stream
17:55:13: FEED: skipped: 21806991
17:55:13: RESPONSE: /get_session_questions (0 examples)
INFO:     ::1:55235 - "POST /get_session_questions HTTP/1.1" 200 OK

ines · April 18, 2021, 12:51am

atakanokan:

⚠ Warning: filtered 75% of entries because they were duplicates. Only 1 items
were shown out of 4. You may want to deduplicate your dataset ahead of time to
get a better understanding of your dataset size.

It looks like the main problem here is that all examples except for one are filtered out because they're considered duplicates, mosty likely because they ended up with the same hashes.

One potential problem is here:

The keys should be iterables of strings, but ("audio") ends up being "audio". So you'd either want this to be ("audio",) or ["audio"].

atakanokan · April 19, 2021, 4:32am

EDIT: Solved. Details in EDIT section below.

Thanks for the prompt answer but even with ("audio",) or ["audio"] I am having the same issue. I don't think the set_hashes() mapping to the same hash is the issue since after calling the function, I log both task and input hashes and they all seem to be different:

DEBUG:root:Inputing example TASK hashes:
DEBUG:root:[-1464920486, 1203448608, -244439580, 1932216617]
DEBUG:root:Inputing example INPUT hashes:
DEBUG:root:[158202857, 1296278939, 318945175, 549324070]

What might be another reason for this? Should I also delete the "Session" datasets before rerunning the script (I run prodigy drop input_dataset and corrected_dataset after each run of the script)?

One thing I noticed was the script is rehashing the stream:

INFO:prodigy:DB: Loading dataset 'asr_trial' (4 examples)
INFO:prodigy:LOADER: Rehashing stream

maybe this is causing the hashes to be the same because I have no control over the parameters of this hashing?

EDIT: For each example in the stream, the text field was ''. When I passed it the audio path to as text field, there were no duplicates and all 4 examples show up in UI. This supports the hypothesis that Prodigy is re-hashing the examples input to the dataset by their text fields. Is there any way to circumvent this?

ines · April 21, 2021, 5:30am

Ah, thanks for the analysis, it looks like the recipe is indeed rehashing the stream I can't think of a good reason why this is done in this particular recipe, so we should remove that. The eaiest workaround is to just remove the rehash=True in recipes/audio.py (you can run prodigy stats to find the location of your Prodigy installation).

Edit: Fixed in v1.11!

Topic		Replies	Views
Upload existing text (previously transcribed) and editing it in Prodigy Audio Transcription recipe enhancement , audio	3	301	November 9, 2023
✨ Audio annotation UI (beta) news , audio	21	4961	March 10, 2023
Prodigy error when reviewing audio annotation coupled with videos usage , audio , video	7	824	December 5, 2020
Multi-stage speaker audio classification with `pyannote.sad.manual` and `audio manual` usage , custom , audio	13	2112	September 28, 2020
prodigy use case for annotation having pre-annotated text usage , solved	8	1264	March 11, 2019

Correct Audio Transcription

Related topics