Correct Audio Transcription


I generated an audio transcription using ASR (and wrote it into a jsonl file) which I would like to correct manually using Prodigy. I understand how to load in a transcription to correct it using dataset: but I cannot load in the audio file that I want to transcribe.

I tried:
prodigy audio.transcribe new_transcription ./recordings dataset:old_transcription
and I got:
prodigy audio.transcribe: error: unrecognized arguments: dataset:old_transcription

When running without the ./recordings part, it works perfectly fine (and the dataset old_transcription certainly does exist) but I obviously don't have the audio loaded in which would be useful for transcribing.

Is there a simple way to do that?

Hi! I think what you're looking for is the --fetch-media flag: this will load back the media (e.g. audio files) from the paths in the JSON when you load your previous annotations back in. Just make sure the data is available via the same path.

yes!:partying_face: thank you so much for the incredibly quick reply!

1 Like

Ok, I actually have one follow up question: it should work with prodigy.serve() integrated in a python script as well, shouldn't it?
So what I'm doing is

prodigy.serve(" prodigy audio.transcribe new dataset:old --fetch-media")

but it doesn't load the old transcription nor the audio file even though it is working as a command in the terminal. Am I missing something?

prodigy.serve should resolve to exactly the same process as running the command from the command line. Are you running the script from the same working directory? If your paths are relative paths to a local directory, the execution context of the script could mean that the paths aren't found.

Right, so I got that working now but what I am actually trying to do is loading a transcription jsonl which was not made using Prodigy and I think that might cause the problem of not loading the audio file.
I looked at the structure of prodigy created jsonl files and created one like this:

transcript = [{'audio': 'audio.mp3', 'text': '', 'meta': {'file': 'audio.mp3'}, 'path': 'audio.mp3', '_input_hash': -758377432, '_task_hash': -1593748291, '_session_id': None, '_view_id': 'blocks', 'audio_spans': [], 'transcript': 'transcript from asr.', 'answer': 'accept'}]

The audio file is in the same directory as my python script and I did try and use its absolute path, which had no effect at all.

I'm pretty sure I got the structure and all key words right (haven't I?) and I'm guessing the problem might come from the input hash and task_hash. I just used hashes from a real prodigy output file and changed some digits, but I also use set_hashes, so I don't know if that causes some kind of error?

I do:

db = connect()
transcript = [set_hashes(transcipt)]
db.add_examples(transcript, datasets=["asr"]) 
prodigy.serve("audio.transcribe annotations dataset:asr --fetch-media")

Thank you so much again for your quick and helpful answers, I appreciate it a lot!

Hi! When you say the audio isn't loading, do you mean that the examples show up but the audio widget / waveform is empty? Or do the examples not get sent out and you see "No tasks available"?

If the waveform is empty, try setting the PRODIGY_LOGGING=basic environment variable and see if there's any output in the logs that indicates that media content couldn't be loaded.

Is there anything in the dataset annotations? The hashes are mostly relevant for detecting whether two examples are identical, and as a result, whether an example should be presented to you for annotation, or whether it's already annotated in the dataset and should be skipped. So if you already have examples with the same hashes in the dataset, new examples coming in with the same hashes would be skipped.

Yup, the audio widget is empty and there's nothing particular in the log, Ithink.

08:39:55: INIT: Setting all logging levels to 10
email-validator not installed, email fields will be treated as str.
To install, run: pip install email-validator
08:39:55: DB: Initializing database SQLite
08:39:55: DB: Connecting to database SQLite
08:39:55: DB: Creating dataset 'asr'
08:39:55: DB: Getting dataset 'asr'
08:39:55: DB: Added 1 examples to 1 datasets
08:39:55: DB: Loading dataset 'asr' (1 examples)
[{'audio:': 'audio.mp3', 'text': '', 'meta': {'file': 'audio.mp3'}, 'path': 'nawalny.mp3', '_input_hash': -758377432, '_task_hash': -1593748291, '_session_id': None, '_view_id': 'blocks', 'audio_spans': [], 'transcript': 'text', 'answer': 'accept'}]
08:39:55: RECIPE: Calling recipe 'audio.transcribe'
08:39:55: RECIPE: Starting recipe audio.transcribe
{'dataset': 'annotations', 'source': 'dataset:asr', 'loader': 'audio', 'playpause_key': ['command+enter', 'option+enter', 'ctrl+enter'], 'text_rows': 4, 'field_id': 'transcript', 'autoplay': False, 'keep_base64': False, 'fetch_media': True, 'exclude': None}

08:39:55: LOADER: Loading stream from dataset asr (answer: all)
08:39:55: DB: Loading dataset 'asr' (1 examples)
08:39:55: LOADER: Rehashing stream
08:39:55: VALIDATE: Validating components returned by recipe
08:39:55: CONTROLLER: Initialising from recipe
{'before_db': <function remove_base64 at 0x7f89f57645e0>, 'config': {'blocks': [{'view_id': 'audio'}, {'view_id': 'text_input', 'field_rows': 4, 'field_label': 'Transcript', 'field_id': 'transcript', 'field_autofocus': True}], 'audio_autoplay': False, 'keymap': {'playpause': ['command+enter', 'option+enter', 'ctrl+enter']}, 'force_stream_order': True, 'dataset': 'annotations', 'recipe_name': 'audio.transcribe'}, 'dataset': 'annotations', 'db': True, 'exclude': None, 'get_session_id': None, 'on_exit': None, 'on_load': None, 'progress': <prodigy.components.progress.ProgressEstimator object at 0x7f89f577afa0>, 'self': <prodigy.core.Controller object at 0x7f89f571a730>, 'stream': <generator object at 0x7f89f572b040>, 'update': None, 'validate_answer': None, 'view_id': 'blocks'}

08:39:55: VALIDATE: Creating validator for view ID 'blocks'
08:39:55: VALIDATE: Validating Prodigy and recipe config

08:39:55: DB: Creating dataset '2021-03-10_08-39-55'
{'created': datetime.datetime(2020, 12, 2, 11, 25, 4)}

08:39:55: CONTROLLER: Initialising from recipe
{'batch_size': 10, 'dataset': 'annotations', 'db': None, 'exclude': 'task', 'filters': [{'name': 'RelatedSessionsFilter', 'cache_size': 10}], 'max_sessions': 10, 'overlap': True, 'self': <prodigy.components.feeds.RepeatingFeed object at 0x7f89f571aa30>, 'stream': <generator object at 0x7f89f572b040>, 'validator': <prodigy.components.validate.Validator object at 0x7f89f571a880>, 'view_id': 'blocks'}

08:39:55: CONTROLLER: Validating the first batch for session: None
08:39:55: PREPROCESS: Fetching media if necessary: ['audio', 'video']
{'input_keys': ['audio', 'video'], 'skip': False, 'stream': <generator object at 0x7f89f5779e50>}

08:39:55: FILTER: Filtering duplicates from stream
{'by_input': True, 'by_task': True, 'stream': <generator object at 0x7f89f5779dc0>, 'warn_fn': <bound method Printer.warn of <wasabi.printer.Printer object at 0x7f89ff253340>>, 'warn_threshold': 0.4}

08:39:55: CORS: initialized with wildcard "*" CORS origins

No, there shouldn't be anything - I'm initializing the empty dataset 'annotations'.

Thanks! And just to confirm, the same command works fine if you run it on the CLI and the audio data is displayed correctly?

Also, could you share the logging output that's shown once you open the app in the browser and load the first examples? It should start with something like POST: /get_session_questions etc.

Yes and no :smiley:
When I do
prodigy audio.transcribe first path/to/audio
,save some transcription and then
prodigy audio.transcribe second dataset:first --fetch-media
it works (transcript and audio are loaded).

But when I want to do the same thing with a manually created jsonl that I add to the database as shown here

the audio widget stays empty whether I'm calling it from a python script or from the shell.

So there must be a problem with the way I'm incorporating (the path to) the audio file in the manually created jsonl file which looks like this:

[{'audio': 'audio.mp3', 'text': '', 'meta': {'file': 'audio.mp3'}, 'path': 'audio.mp3', '_input_hash': -758377432, '_task_hash': -1593748291, '_session_id': None, '_view_id': 'blocks', 'audio_spans': [], 'transcript': 'transcript from asr.', 'answer': 'accept'}]

What I'm curious about is the path key, because in the jsonl files I looked at that were created by Prodigy, its value is only the name of the audio file, so just like the value of the audio key which doesn't seem logical to me. I tried replacing it by its absolute path which unfortunately didn't change anything. But I think we might be getting close to solving the problem now ^^.

14:08:45: POST: /get_session_questions
14:08:45: FEED: Finding next batch of questions in stream
14:08:45: RESPONSE: /get_session_questions (1 examples)
INFO: - "POST /get_session_questions HTTP/1.1" 200 OK

Edit: more log stuff
14:56:57: DB: Initializing database SQLite
14:56:57: DB: Connecting to database SQLite
14:56:57: DB: Getting dataset 'old_dataset'
14:56:57: DB: Getting dataset 'old_dataset'
14:56:57: DB: Added 1 examples to 1 datasets
14:56:57: RECIPE: Calling recipe 'audio.transcribe'
14:56:57: RECIPE: Starting recipe audio.transcribe
{'dataset': 'new_dataset', 'source': 'dataset:old_dataset', 'loader': 'audio', 'playpause_key': ['command+enter', 'option+enter', 'ctrl+enter'], 'text_rows': 4, 'field_id': 'transcript', 'autoplay': False, 'keep_base64': False, 'fetch_media': True, 'exclude': None}

The "path" key is only really used to keep a copy of the original path when the value of "audio" etc. is replaced by the loaded base64-encoded data. Before the examples are placed in the database, Prodigy will then strip out the base64 data again and replace it with the path, so the data stored in the DB stays compact. (This is the default behaviour for audio recipes unless you set --keep-base64.)

If fetching the data works with the existing dataset and not your generated examples, there must be a subtle difference here :thinking: What does the resulting stream look like when you call the fetch_media preprocessor from Python, e.g. like this? Does it have base64 strings in it?

from prodigy.components.preprocess import fetch_media

stream = fetch_media(sample_examples)
from prodigy.components.preprocess import fetch_media
stream = [{"audio": "/absolute/path/to/file/audio.mp3"}]
stream = fetch_media(stream, "audio", skip=False)

This is what you mean, right? Because the only thing that is printed is the value of the key "audio", so "/absolute/path/to/file/audio.mp3"

Yes, exactly – but the second argument should be a list/tuple, so ["audio"]. (If not, I think it'll iterate over the letters as keys.)

Oh, ok I just used a string because it was done like that here

from prodigy.components.preprocess import fetch_media

stream = [{"image": "/path/to/image.jpg"}, {"image": ""}]
stream = fetch_media(stream, "image", skip=True)

But anyway, using a list, it returns a long base64 encoded string like this tupbGJU5bzS8vgEESABCDfIapUYrww4yMbV0vtzg/NA6zhxFcsMbG2Icuk6xL8CMzuafZF3YxCZQn5sK12/3rOc/VqYfR7P3JWXclbSd7bbU9b2+jLXXrFf69txtQp/qkKPrF86tiJ6fWZ7a+ocCmd7zr41W9Pq/xT5x/67zreNa3akWbFs+/v4FoGK
(And when decoded, it's nonsense characters.)

Hmm, this all looks correct to me – the fact that it encodes the data means that the file paths are found and are correct. (Don't worry about the decoded data not making sense because that's literally the binary audio data.)

I'll experiment with this some more but it's definitely pretty mysterious that it works if you run it one way and not the other :thinking: If you have a reproducible example that I can run, that would definitely be helpful.

I solved it! It was simply a stupid typo - I had "audio:" as a key in the dictionary instead of "audio".

1 Like

Omg, so glad you got it working and sorry to hear! I can definitely relate, it's always some typo :sweat_smile:

If you don't mind me asking, how did you load in a transcription to correct it using Prodigy (with the associated audio files)?

prodigy audio.transcribe new_transcription ./recordings dataset:old_transcription

How did you read in your ASR generated transcriptions to old_transcription dataset?

I have my transcription as fluid text saved in asr_transcription.txt.
I create a dictionary (jsonl) where I put the transcription as a string alongside the requested key words 'audio', 'text' etc as shown below (the hash numbers are practically random). I then add this dictionary to the databse as shown here:
At the end I call prodigy audio.transcribe with a new dataset name and dataset:dataset that I added to the database earlier and --fetch-media to fetch the audio file specified in the jsonl dictionary.

  with open("asr_transcription.txt", encoding="utf-8") as asr_read:
        text = ""
        for line in asr_read.readlines():
            if line[0].isalpha():
                text += line.strip()+"\n" 

        jsonl = {'audio': filename, 'text': '', 'meta': {'file': filename}, 'path': filename, '_input_hash': -758377432, '_task_hash': -1593748291, '_session_id': None, '_view_id': 'blocks', 'audio_spans': [],"transcript": text, "answer": "accept"}

    # save asr transcription to prodigy database
    db = connect()
    # hash the examples to make sure they all have a unique task hash and input hash – this is used by Prodigy to distinguish between annotations on the same input data
    jsonl_annotations = [set_hashes(jsonl)] 

    # create new dataset
    # add examples to the (existing!) dataset
    db.add_examples(jsonl_annotations, datasets=[dataset]) 

    #open prodigy to review automatic transcriptionion (--fetch-media loads in audio as well)
    prodigy.serve("audio.transcribe {} dataset:{} --fetch-media".format(reviewed_dataset, dataset))

Hope this helps!


Thanks for sharing! :+1: One quick comment on that in case others come across this thread later: it's not really necessary to add the data to a dataset – instead, you can also load it directly from a .jsonl file and set --loader jsonl.