Correct Audio Transcription

Right, so I got that working now but what I am actually trying to do is loading a transcription jsonl which was not made using Prodigy and I think that might cause the problem of not loading the audio file.
I looked at the structure of prodigy created jsonl files and created one like this:

transcript = [{'audio': 'audio.mp3', 'text': '', 'meta': {'file': 'audio.mp3'}, 'path': 'audio.mp3', '_input_hash': -758377432, '_task_hash': -1593748291, '_session_id': None, '_view_id': 'blocks', 'audio_spans': [], 'transcript': 'transcript from asr.', 'answer': 'accept'}]

The audio file is in the same directory as my python script and I did try and use its absolute path, which had no effect at all.

I'm pretty sure I got the structure and all key words right (haven't I?) and I'm guessing the problem might come from the input hash and task_hash. I just used hashes from a real prodigy output file and changed some digits, but I also use set_hashes, so I don't know if that causes some kind of error?

I do:

db = connect()
transcript = [set_hashes(transcipt)]
db.add_dataset("asr")
db.add_examples(transcript, datasets=["asr"]) 
prodigy.serve("audio.transcribe annotations dataset:asr --fetch-media")

Thank you so much again for your quick and helpful answers, I appreciate it a lot!

Hi! When you say the audio isn't loading, do you mean that the examples show up but the audio widget / waveform is empty? Or do the examples not get sent out and you see "No tasks available"?

If the waveform is empty, try setting the PRODIGY_LOGGING=basic environment variable and see if there's any output in the logs that indicates that media content couldn't be loaded.

Is there anything in the dataset annotations? The hashes are mostly relevant for detecting whether two examples are identical, and as a result, whether an example should be presented to you for annotation, or whether it's already annotated in the dataset and should be skipped. So if you already have examples with the same hashes in the dataset, new examples coming in with the same hashes would be skipped.

Yup, the audio widget is empty and there's nothing particular in the log, Ithink.

08:39:55: INIT: Setting all logging levels to 10
email-validator not installed, email fields will be treated as str.
To install, run: pip install email-validator
08:39:55: DB: Initializing database SQLite
08:39:55: DB: Connecting to database SQLite
08:39:55: DB: Creating dataset 'asr'
08:39:55: DB: Getting dataset 'asr'
08:39:55: DB: Added 1 examples to 1 datasets
08:39:55: DB: Loading dataset 'asr' (1 examples)
[{'audio:': 'audio.mp3', 'text': '', 'meta': {'file': 'audio.mp3'}, 'path': 'nawalny.mp3', '_input_hash': -758377432, '_task_hash': -1593748291, '_session_id': None, '_view_id': 'blocks', 'audio_spans': [], 'transcript': 'text', 'answer': 'accept'}]
08:39:55: RECIPE: Calling recipe 'audio.transcribe'
08:39:55: RECIPE: Starting recipe audio.transcribe
{'dataset': 'annotations', 'source': 'dataset:asr', 'loader': 'audio', 'playpause_key': ['command+enter', 'option+enter', 'ctrl+enter'], 'text_rows': 4, 'field_id': 'transcript', 'autoplay': False, 'keep_base64': False, 'fetch_media': True, 'exclude': None}

08:39:55: LOADER: Loading stream from dataset asr (answer: all)
08:39:55: DB: Loading dataset 'asr' (1 examples)
08:39:55: LOADER: Rehashing stream
08:39:55: VALIDATE: Validating components returned by recipe
08:39:55: CONTROLLER: Initialising from recipe
{'before_db': <function remove_base64 at 0x7f89f57645e0>, 'config': {'blocks': [{'view_id': 'audio'}, {'view_id': 'text_input', 'field_rows': 4, 'field_label': 'Transcript', 'field_id': 'transcript', 'field_autofocus': True}], 'audio_autoplay': False, 'keymap': {'playpause': ['command+enter', 'option+enter', 'ctrl+enter']}, 'force_stream_order': True, 'dataset': 'annotations', 'recipe_name': 'audio.transcribe'}, 'dataset': 'annotations', 'db': True, 'exclude': None, 'get_session_id': None, 'on_exit': None, 'on_load': None, 'progress': <prodigy.components.progress.ProgressEstimator object at 0x7f89f577afa0>, 'self': <prodigy.core.Controller object at 0x7f89f571a730>, 'stream': <generator object at 0x7f89f572b040>, 'update': None, 'validate_answer': None, 'view_id': 'blocks'}

08:39:55: VALIDATE: Creating validator for view ID 'blocks'
08:39:55: VALIDATE: Validating Prodigy and recipe config

08:39:55: DB: Creating dataset '2021-03-10_08-39-55'
{'created': datetime.datetime(2020, 12, 2, 11, 25, 4)}

08:39:55: CONTROLLER: Initialising from recipe
{'batch_size': 10, 'dataset': 'annotations', 'db': None, 'exclude': 'task', 'filters': [{'name': 'RelatedSessionsFilter', 'cache_size': 10}], 'max_sessions': 10, 'overlap': True, 'self': <prodigy.components.feeds.RepeatingFeed object at 0x7f89f571aa30>, 'stream': <generator object at 0x7f89f572b040>, 'validator': <prodigy.components.validate.Validator object at 0x7f89f571a880>, 'view_id': 'blocks'}

08:39:55: CONTROLLER: Validating the first batch for session: None
08:39:55: PREPROCESS: Fetching media if necessary: ['audio', 'video']
{'input_keys': ['audio', 'video'], 'skip': False, 'stream': <generator object at 0x7f89f5779e50>}

08:39:55: FILTER: Filtering duplicates from stream
{'by_input': True, 'by_task': True, 'stream': <generator object at 0x7f89f5779dc0>, 'warn_fn': <bound method Printer.warn of <wasabi.printer.Printer object at 0x7f89ff253340>>, 'warn_threshold': 0.4}

08:39:55: CORS: initialized with wildcard "*" CORS origins

No, there shouldn't be anything - I'm initializing the empty dataset 'annotations'.

Thanks! And just to confirm, the same command works fine if you run it on the CLI and the audio data is displayed correctly?

Also, could you share the logging output that's shown once you open the app in the browser and load the first examples? It should start with something like POST: /get_session_questions etc.

Yes and no :smiley:
When I do
prodigy audio.transcribe first path/to/audio
,save some transcription and then
prodigy audio.transcribe second dataset:first --fetch-media
it works (transcript and audio are loaded).

But when I want to do the same thing with a manually created jsonl that I add to the database as shown here

the audio widget stays empty whether I'm calling it from a python script or from the shell.

So there must be a problem with the way I'm incorporating (the path to) the audio file in the manually created jsonl file which looks like this:

[{'audio': 'audio.mp3', 'text': '', 'meta': {'file': 'audio.mp3'}, 'path': 'audio.mp3', '_input_hash': -758377432, '_task_hash': -1593748291, '_session_id': None, '_view_id': 'blocks', 'audio_spans': [], 'transcript': 'transcript from asr.', 'answer': 'accept'}]

What I'm curious about is the path key, because in the jsonl files I looked at that were created by Prodigy, its value is only the name of the audio file, so just like the value of the audio key which doesn't seem logical to me. I tried replacing it by its absolute path which unfortunately didn't change anything. But I think we might be getting close to solving the problem now ^^.

14:08:45: POST: /get_session_questions
14:08:45: FEED: Finding next batch of questions in stream
14:08:45: RESPONSE: /get_session_questions (1 examples)
INFO:     127.0.0.1:53862 - "POST /get_session_questions HTTP/1.1" 200 OK

Edit: more log stuff
14:56:57: DB: Initializing database SQLite
14:56:57: DB: Connecting to database SQLite
14:56:57: DB: Getting dataset 'old_dataset'
14:56:57: DB: Getting dataset 'old_dataset'
14:56:57: DB: Added 1 examples to 1 datasets
14:56:57: RECIPE: Calling recipe 'audio.transcribe'
14:56:57: RECIPE: Starting recipe audio.transcribe
{'dataset': 'new_dataset', 'source': 'dataset:old_dataset', 'loader': 'audio', 'playpause_key': ['command+enter', 'option+enter', 'ctrl+enter'], 'text_rows': 4, 'field_id': 'transcript', 'autoplay': False, 'keep_base64': False, 'fetch_media': True, 'exclude': None}

The "path" key is only really used to keep a copy of the original path when the value of "audio" etc. is replaced by the loaded base64-encoded data. Before the examples are placed in the database, Prodigy will then strip out the base64 data again and replace it with the path, so the data stored in the DB stays compact. (This is the default behaviour for audio recipes unless you set --keep-base64.)

If fetching the data works with the existing dataset and not your generated examples, there must be a subtle difference here :thinking: What does the resulting stream look like when you call the fetch_media preprocessor from Python, e.g. like this? Does it have base64 strings in it?

from prodigy.components.preprocess import fetch_media

stream = fetch_media(sample_examples)
print(list(stream))
from prodigy.components.preprocess import fetch_media
stream = [{"audio": "/absolute/path/to/file/audio.mp3"}]
stream = fetch_media(stream, "audio", skip=False)
print(list(stream))

This is what you mean, right? Because the only thing that is printed is the value of the key "audio", so "/absolute/path/to/file/audio.mp3"

Yes, exactly – but the second argument should be a list/tuple, so ["audio"]. (If not, I think it'll iterate over the letters as keys.)

Oh, ok I just used a string because it was done like that here https://prodi.gy/docs/api-loaders.

from prodigy.components.preprocess import fetch_media

stream = [{"image": "/path/to/image.jpg"}, {"image": "https://example.com/image.jpg"}]
stream = fetch_media(stream, "image", skip=True)

But anyway, using a list, it returns a long base64 encoded string like this tupbGJU5bzS8vgEESABCDfIapUYrww4yMbV0vtzg/NA6zhxFcsMbG2Icuk6xL8CMzuafZF3YxCZQn5sK12/3rOc/VqYfR7P3JWXclbSd7bbU9b2+jLXXrFf69txtQp/qkKPrF86tiJ6fWZ7a+ocCmd7zr41W9Pq/xT5x/67zreNa3akWbFs+/v4FoGK
(And when decoded, it's nonsense characters.)

Hmm, this all looks correct to me – the fact that it encodes the data means that the file paths are found and are correct. (Don't worry about the decoded data not making sense because that's literally the binary audio data.)

I'll experiment with this some more but it's definitely pretty mysterious that it works if you run it one way and not the other :thinking: If you have a reproducible example that I can run, that would definitely be helpful.

I solved it! It was simply a stupid typo - I had "audio:" as a key in the dictionary instead of "audio".

1 Like

Omg, so glad you got it working and sorry to hear! I can definitely relate, it's always some typo :sweat_smile:

If you don't mind me asking, how did you load in a transcription to correct it using Prodigy (with the associated audio files)?

prodigy audio.transcribe new_transcription ./recordings dataset:old_transcription

How did you read in your ASR generated transcriptions to old_transcription dataset?

I have my transcription as fluid text saved in asr_transcription.txt.
I create a dictionary (jsonl) where I put the transcription as a string alongside the requested key words 'audio', 'text' etc as shown below (the hash numbers are practically random). I then add this dictionary to the databse as shown here: https://prodi.gy/docs/api-database.
At the end I call prodigy audio.transcribe with a new dataset name and dataset:dataset that I added to the database earlier and --fetch-media to fetch the audio file specified in the jsonl dictionary.

  with open("asr_transcription.txt", encoding="utf-8") as asr_read:
        text = ""
        for line in asr_read.readlines():
            if line[0].isalpha():
                text += line.strip()+"\n" 

        jsonl = {'audio': filename, 'text': '', 'meta': {'file': filename}, 'path': filename, '_input_hash': -758377432, '_task_hash': -1593748291, '_session_id': None, '_view_id': 'blocks', 'audio_spans': [],"transcript": text, "answer": "accept"}
      

    # save asr transcription to prodigy database
    db = connect()
    # hash the examples to make sure they all have a unique task hash and input hash – this is used by Prodigy to distinguish between annotations on the same input data
    jsonl_annotations = [set_hashes(jsonl)] 

    # create new dataset
    db.add_dataset(dataset)
    # add examples to the (existing!) dataset
    db.add_examples(jsonl_annotations, datasets=[dataset]) 

    #open prodigy to review automatic transcriptionion (--fetch-media loads in audio as well)
    prodigy.serve("audio.transcribe {} dataset:{} --fetch-media".format(reviewed_dataset, dataset))

Hope this helps!

3 Likes

Thanks for sharing! :+1: One quick comment on that in case others come across this thread later: it's not really necessary to add the data to a dataset – instead, you can also load it directly from a .jsonl file and set --loader jsonl.

Thank you, this was very helpful.

One problem I had with this was when I tried to add multiple examples, prodigy would only show one of them and drop others as duplicates.

Calling set_hashes()

examples = [set_hashes(example, overwrite = True, input_keys = ("audio"), task_keys = ("audio")) for example in first_pass_transcripts]

I print out task and input hashes for all examples:

DEBUG:root: example TASK hashes:
DEBUG:root:[2025448499, -677278922, 144250741, -581610436]
DEBUG:root: example INPUT hashes:
DEBUG:root:[819208844, 1154496076, -2057892666, 61420621]

I call the python script that has the same prodigy.serve call with audio.transcribe

...
INFO:prodigy:DB: Creating dataset 'asr_trial'
...
INFO:prodigy:DB: Getting dataset 'asr_trial'
...
INFO:prodigy:DB: Added 4 examples to 1 datasets
...
INFO:prodigy:RECIPE: Calling recipe 'audio.transcribe'
INFO:prodigy:RECIPE: Starting recipe audio.transcribe
INFO:prodigy:LOADER: Loading stream from dataset asr_trial (answer: all)
...
INFO:prodigy:DB: Loading dataset 'asr_trial' (4 examples)
...
INFO:prodigy:LOADER: Rehashing stream
INFO:prodigy:VALIDATE: Validating components returned by recipe
INFO:prodigy:CONTROLLER: Initialising from recipe
INFO:prodigy:VALIDATE: Creating validator for view ID 'blocks'
INFO:prodigy:VALIDATE: Validating Prodigy and recipe config
...
INFO:prodigy:DB: Creating dataset 'asr_trial_corrected'
...
INFO:prodigy:DB: Creating dataset '2021-04-16_17-23-10'
...
INFO:prodigy:CONTROLLER: Initialising from recipe
INFO:prodigy:CONTROLLER: Validating the first batch for session: None
INFO:prodigy:PREPROCESS: Fetching media if necessary: ['audio', 'video']
INFO:prodigy:FILTER: Filtering duplicates from stream
INFO:prodigy:CORS: initialized with wildcard "*" CORS origins

✨  Starting the web server at http://localhost:8080 ...

When I open up the UI, shows the first example with the audio & transcription, but when I accept/reject/ignore that then shows No tasks available.

INFO:prodigy:POST: /get_session_questions
INFO:prodigy:FEED: Finding next batch of questions in stream
INFO:prodigy:FEED: skipped: -1607572746
INFO:prodigy:RESPONSE: /get_session_questions (0 examples)

@ines

  1. What might be the reason that it's skipping other examples?
  2. Why is the hash that is skipped different than any hash I have in my input?

I also tried doing it via command line to have the same issue with duplicates. After writing all examples to a .jsonl, I tried to load it via CLI:

prodigy audio.transcribe asr_trial_corrected ./data/transcriptions/prodigy_input_aws_transcribe.jsonl --fetch-media --loader jsonl

Still shows No tasks available after the first example.

17:55:11: FEED: Finding next batch of questions in stream
⚠ Warning: filtered 75% of entries because they were duplicates. Only 1 items
were shown out of 4. You may want to deduplicate your dataset ahead of time to
get a better understanding of your dataset size.
17:55:11: RESPONSE: /get_session_questions (1 examples)
INFO:     ::1:55235 - "POST /get_session_questions HTTP/1.1" 200 OK
17:55:13: POST: /get_session_questions
17:55:13: FEED: Finding next batch of questions in stream
17:55:13: FEED: skipped: 21806991
17:55:13: RESPONSE: /get_session_questions (0 examples)
INFO:     ::1:55235 - "POST /get_session_questions HTTP/1.1" 200 OK

It looks like the main problem here is that all examples except for one are filtered out because they're considered duplicates, mosty likely because they ended up with the same hashes.

One potential problem is here:

The keys should be iterables of strings, but ("audio") ends up being "audio". So you'd either want this to be ("audio",) or ["audio"].

EDIT: Solved. Details in EDIT section below.

Thanks for the prompt answer but even with ("audio",) or ["audio"] I am having the same issue. I don't think the set_hashes() mapping to the same hash is the issue since after calling the function, I log both task and input hashes and they all seem to be different:

DEBUG:root:Inputing example TASK hashes:
DEBUG:root:[-1464920486, 1203448608, -244439580, 1932216617]
DEBUG:root:Inputing example INPUT hashes:
DEBUG:root:[158202857, 1296278939, 318945175, 549324070]

What might be another reason for this? Should I also delete the "Session" datasets before rerunning the script (I run prodigy drop input_dataset and corrected_dataset after each run of the script)?

One thing I noticed was the script is rehashing the stream:

INFO:prodigy:DB: Loading dataset 'asr_trial' (4 examples)
INFO:prodigy:LOADER: Rehashing stream

maybe this is causing the hashes to be the same because I have no control over the parameters of this hashing?

EDIT: For each example in the stream, the text field was ''. When I passed it the audio path to as text field, there were no duplicates and all 4 examples show up in UI. This supports the hypothesis that Prodigy is re-hashing the examples input to the dataset by their text fields. Is there any way to circumvent this?

Ah, thanks for the analysis, it looks like the recipe is indeed rehashing the stream :thinking: I can't think of a good reason why this is done in this particular recipe, so we should remove that. The eaiest workaround is to just remove the rehash=True in recipes/audio.py (you can run prodigy stats to find the location of your Prodigy installation).

Edit: Fixed in v1.11!