Correct Audio Transcription

Sophia · March 2, 2021, 9:03am

Hi,

I generated an audio transcription using ASR (and wrote it into a jsonl file) which I would like to correct manually using Prodigy. I understand how to load in a transcription to correct it using dataset: but I cannot load in the audio file that I want to transcribe.

I tried:
prodigy audio.transcribe new_transcription ./recordings dataset:old_transcription
and I got:
prodigy audio.transcribe: error: unrecognized arguments: dataset:old_transcription

When running without the ./recordings part, it works perfectly fine (and the dataset old_transcription certainly does exist) but I obviously don't have the audio loaded in which would be useful for transcribing.

Is there a simple way to do that?
Thanks!

ines · March 2, 2021, 1:18pm

Hi! I think what you're looking for is the --fetch-media flag: this will load back the media (e.g. audio files) from the paths in the JSON when you load your previous annotations back in. Just make sure the data is available via the same path.

Sophia · March 2, 2021, 7:28pm

yes! thank you so much for the incredibly quick reply!

Sophia · March 8, 2021, 8:33am

Ok, I actually have one follow up question: it should work with prodigy.serve() integrated in a python script as well, shouldn't it?
So what I'm doing is

prodigy.serve(" prodigy audio.transcribe new dataset:old --fetch-media")

but it doesn't load the old transcription nor the audio file even though it is working as a command in the terminal. Am I missing something?
Cheers

ines · March 8, 2021, 10:28am

prodigy.serve should resolve to exactly the same process as running the command from the command line. Are you running the script from the same working directory? If your paths are relative paths to a local directory, the execution context of the script could mean that the paths aren't found.

Sophia · March 9, 2021, 12:42pm

Right, so I got that working now but what I am actually trying to do is loading a transcription jsonl which was not made using Prodigy and I think that might cause the problem of not loading the audio file.
I looked at the structure of prodigy created jsonl files and created one like this:

transcript = [{'audio': 'audio.mp3', 'text': '', 'meta': {'file': 'audio.mp3'}, 'path': 'audio.mp3', '_input_hash': -758377432, '_task_hash': -1593748291, '_session_id': None, '_view_id': 'blocks', 'audio_spans': [], 'transcript': 'transcript from asr.', 'answer': 'accept'}]

The audio file is in the same directory as my python script and I did try and use its absolute path, which had no effect at all.

I'm pretty sure I got the structure and all key words right (haven't I?) and I'm guessing the problem might come from the input hash and task_hash. I just used hashes from a real prodigy output file and changed some digits, but I also use set_hashes, so I don't know if that causes some kind of error?

I do:

db = connect()
transcript = [set_hashes(transcipt)]
db.add_dataset("asr")
db.add_examples(transcript, datasets=["asr"]) 
prodigy.serve("audio.transcribe annotations dataset:asr --fetch-media")

Thank you so much again for your quick and helpful answers, I appreciate it a lot!

ines · March 10, 2021, 2:30am

Hi! When you say the audio isn't loading, do you mean that the examples show up but the audio widget / waveform is empty? Or do the examples not get sent out and you see "No tasks available"?

If the waveform is empty, try setting the PRODIGY_LOGGING=basic environment variable and see if there's any output in the logs that indicates that media content couldn't be loaded.

Is there anything in the dataset annotations? The hashes are mostly relevant for detecting whether two examples are identical, and as a result, whether an example should be presented to you for annotation, or whether it's already annotated in the dataset and should be skipped. So if you already have examples with the same hashes in the dataset, new examples coming in with the same hashes would be skipped.

Sophia · March 10, 2021, 7:47am

Yup, the audio widget is empty and there's nothing particular in the log, Ithink.

08:39:55: INIT: Setting all logging levels to 10
email-validator not installed, email fields will be treated as str.
To install, run: pip install email-validator
08:39:55: DB: Initializing database SQLite
08:39:55: DB: Connecting to database SQLite
08:39:55: DB: Creating dataset 'asr'
08:39:55: DB: Getting dataset 'asr'
08:39:55: DB: Added 1 examples to 1 datasets
08:39:55: DB: Loading dataset 'asr' (1 examples)
[{'audio:': 'audio.mp3', 'text': '', 'meta': {'file': 'audio.mp3'}, 'path': 'nawalny.mp3', '_input_hash': -758377432, '_task_hash': -1593748291, '_session_id': None, '_view_id': 'blocks', 'audio_spans': [], 'transcript': 'text', 'answer': 'accept'}]
08:39:55: RECIPE: Calling recipe 'audio.transcribe'
08:39:55: RECIPE: Starting recipe audio.transcribe
{'dataset': 'annotations', 'source': 'dataset:asr', 'loader': 'audio', 'playpause_key': ['command+enter', 'option+enter', 'ctrl+enter'], 'text_rows': 4, 'field_id': 'transcript', 'autoplay': False, 'keep_base64': False, 'fetch_media': True, 'exclude': None}

08:39:55: LOADER: Loading stream from dataset asr (answer: all)
08:39:55: DB: Loading dataset 'asr' (1 examples)
08:39:55: LOADER: Rehashing stream
08:39:55: VALIDATE: Validating components returned by recipe
08:39:55: CONTROLLER: Initialising from recipe
{'before_db': <function remove_base64 at 0x7f89f57645e0>, 'config': {'blocks': [{'view_id': 'audio'}, {'view_id': 'text_input', 'field_rows': 4, 'field_label': 'Transcript', 'field_id': 'transcript', 'field_autofocus': True}], 'audio_autoplay': False, 'keymap': {'playpause': ['command+enter', 'option+enter', 'ctrl+enter']}, 'force_stream_order': True, 'dataset': 'annotations', 'recipe_name': 'audio.transcribe'}, 'dataset': 'annotations', 'db': True, 'exclude': None, 'get_session_id': None, 'on_exit': None, 'on_load': None, 'progress': <prodigy.components.progress.ProgressEstimator object at 0x7f89f577afa0>, 'self': <prodigy.core.Controller object at 0x7f89f571a730>, 'stream': <generator object at 0x7f89f572b040>, 'update': None, 'validate_answer': None, 'view_id': 'blocks'}

08:39:55: VALIDATE: Creating validator for view ID 'blocks'
08:39:55: VALIDATE: Validating Prodigy and recipe config

08:39:55: DB: Creating dataset '2021-03-10_08-39-55'
{'created': datetime.datetime(2020, 12, 2, 11, 25, 4)}

08:39:55: CONTROLLER: Initialising from recipe
{'batch_size': 10, 'dataset': 'annotations', 'db': None, 'exclude': 'task', 'filters': [{'name': 'RelatedSessionsFilter', 'cache_size': 10}], 'max_sessions': 10, 'overlap': True, 'self': <prodigy.components.feeds.RepeatingFeed object at 0x7f89f571aa30>, 'stream': <generator object at 0x7f89f572b040>, 'validator': <prodigy.components.validate.Validator object at 0x7f89f571a880>, 'view_id': 'blocks'}

08:39:55: CONTROLLER: Validating the first batch for session: None
08:39:55: PREPROCESS: Fetching media if necessary: ['audio', 'video']
{'input_keys': ['audio', 'video'], 'skip': False, 'stream': <generator object at 0x7f89f5779e50>}

08:39:55: FILTER: Filtering duplicates from stream
{'by_input': True, 'by_task': True, 'stream': <generator object at 0x7f89f5779dc0>, 'warn_fn': <bound method Printer.warn of <wasabi.printer.Printer object at 0x7f89ff253340>>, 'warn_threshold': 0.4}

08:39:55: CORS: initialized with wildcard "*" CORS origins

No, there shouldn't be anything - I'm initializing the empty dataset 'annotations'.

ines · March 10, 2021, 9:59am

Thanks! And just to confirm, the same command works fine if you run it on the CLI and the audio data is displayed correctly?

Also, could you share the logging output that's shown once you open the app in the browser and load the first examples? It should start with something like POST: /get_session_questions etc.

Sophia · March 10, 2021, 1:22pm

Yes and no
When I do
prodigy audio.transcribe first path/to/audio
,save some transcription and then
prodigy audio.transcribe second dataset:first --fetch-media
it works (transcript and audio are loaded).

But when I want to do the same thing with a manually created jsonl that I add to the database as shown here

Sophia:

db = connect()
transcript = [set_hashes(transcipt)]
db.add_dataset("asr")
db.add_examples(transcript, datasets=["asr"]) 
prodigy.serve("audio.transcribe annotations dataset:asr --fetch-media")

the audio widget stays empty whether I'm calling it from a python script or from the shell.

So there must be a problem with the way I'm incorporating (the path to) the audio file in the manually created jsonl file which looks like this:

[{'audio': 'audio.mp3', 'text': '', 'meta': {'file': 'audio.mp3'}, 'path': 'audio.mp3', '_input_hash': -758377432, '_task_hash': -1593748291, '_session_id': None, '_view_id': 'blocks', 'audio_spans': , 'transcript': 'transcript from asr.', 'answer': 'accept'}]

What I'm curious about is the path key, because in the jsonl files I looked at that were created by Prodigy, its value is only the name of the audio file, so just like the value of the audio key which doesn't seem logical to me. I tried replacing it by its absolute path which unfortunately didn't change anything. But I think we might be getting close to solving the problem now ^^.

14:08:45: POST: /get_session_questions
14:08:45: FEED: Finding next batch of questions in stream
14:08:45: RESPONSE: /get_session_questions (1 examples)
INFO:     127.0.0.1:53862 - "POST /get_session_questions HTTP/1.1" 200 OK

Edit: more log stuff
14:56:57: DB: Initializing database SQLite
14:56:57: DB: Connecting to database SQLite
14:56:57: DB: Getting dataset 'old_dataset'
14:56:57: DB: Getting dataset 'old_dataset'
14:56:57: DB: Added 1 examples to 1 datasets
14:56:57: RECIPE: Calling recipe 'audio.transcribe'
14:56:57: RECIPE: Starting recipe audio.transcribe
{'dataset': 'new_dataset', 'source': 'dataset:old_dataset', 'loader': 'audio', 'playpause_key': ['command+enter', 'option+enter', 'ctrl+enter'], 'text_rows': 4, 'field_id': 'transcript', 'autoplay': False, 'keep_base64': False, 'fetch_media': True, 'exclude': None}

ines · March 12, 2021, 1:55am

The "path" key is only really used to keep a copy of the original path when the value of "audio" etc. is replaced by the loaded base64-encoded data. Before the examples are placed in the database, Prodigy will then strip out the base64 data again and replace it with the path, so the data stored in the DB stays compact. (This is the default behaviour for audio recipes unless you set --keep-base64.)

If fetching the data works with the existing dataset and not your generated examples, there must be a subtle difference here What does the resulting stream look like when you call the fetch_media preprocessor from Python, e.g. like this? Does it have base64 strings in it?

from prodigy.components.preprocess import fetch_media

stream = fetch_media(sample_examples)
print(list(stream))

Sophia · March 12, 2021, 11:01am

from prodigy.components.preprocess import fetch_media
stream = [{"audio": "/absolute/path/to/file/audio.mp3"}]
stream = fetch_media(stream, "audio", skip=False)
print(list(stream))

This is what you mean, right? Because the only thing that is printed is the value of the key "audio", so "/absolute/path/to/file/audio.mp3"

ines · March 13, 2021, 2:46am

Yes, exactly – but the second argument should be a list/tuple, so ["audio"]. (If not, I think it'll iterate over the letters as keys.)

Sophia · March 13, 2021, 8:35am

Oh, ok I just used a string because it was done like that here https://prodi.gy/docs/api-loaders.

from prodigy.components.preprocess import fetch_media

stream = [{"image": "/path/to/image.jpg"}, {"image": "https://example.com/image.jpg"}]
stream = fetch_media(stream, "image", skip=True)

But anyway, using a list, it returns a long base64 encoded string like this tupbGJU5bzS8vgEESABCDfIapUYrww4yMbV0vtzg/NA6zhxFcsMbG2Icuk6xL8CMzuafZF3YxCZQn5sK12/3rOc/VqYfR7P3JWXclbSd7bbU9b2+jLXXrFf69txtQp/qkKPrF86tiJ6fWZ7a+ocCmd7zr41W9Pq/xT5x/67zreNa3akWbFs+/v4FoGK
(And when decoded, it's nonsense characters.)

ines · March 16, 2021, 6:05am

Hmm, this all looks correct to me – the fact that it encodes the data means that the file paths are found and are correct. (Don't worry about the decoded data not making sense because that's literally the binary audio data.)

I'll experiment with this some more but it's definitely pretty mysterious that it works if you run it one way and not the other If you have a reproducible example that I can run, that would definitely be helpful.

Sophia · March 16, 2021, 1:09pm

I solved it! It was simply a stupid typo - I had "audio:" as a key in the dictionary instead of "audio".

ines · March 17, 2021, 12:07am

Omg, so glad you got it working and sorry to hear! I can definitely relate, it's always some typo

atakanokan · April 12, 2021, 4:07am

If you don't mind me asking, how did you load in a transcription to correct it using Prodigy (with the associated audio files)?

prodigy audio.transcribe new_transcription ./recordings dataset:old_transcription

How did you read in your ASR generated transcriptions to old_transcription dataset?

Sophia · April 12, 2021, 2:31pm

I have my transcription as fluid text saved in asr_transcription.txt.
I create a dictionary (jsonl) where I put the transcription as a string alongside the requested key words 'audio', 'text' etc as shown below (the hash numbers are practically random). I then add this dictionary to the databse as shown here: https://prodi.gy/docs/api-database.
At the end I call prodigy audio.transcribe with a new dataset name and dataset:dataset that I added to the database earlier and --fetch-media to fetch the audio file specified in the jsonl dictionary.

  with open("asr_transcription.txt", encoding="utf-8") as asr_read:
        text = ""
        for line in asr_read.readlines():
            if line[0].isalpha():
                text += line.strip()+"\n" 

        jsonl = {'audio': filename, 'text': '', 'meta': {'file': filename}, 'path': filename, '_input_hash': -758377432, '_task_hash': -1593748291, '_session_id': None, '_view_id': 'blocks', 'audio_spans': [],"transcript": text, "answer": "accept"}
      

    # save asr transcription to prodigy database
    db = connect()
    # hash the examples to make sure they all have a unique task hash and input hash – this is used by Prodigy to distinguish between annotations on the same input data
    jsonl_annotations = [set_hashes(jsonl)] 

    # create new dataset
    db.add_dataset(dataset)
    # add examples to the (existing!) dataset
    db.add_examples(jsonl_annotations, datasets=[dataset]) 

    #open prodigy to review automatic transcriptionion (--fetch-media loads in audio as well)
    prodigy.serve("audio.transcribe {} dataset:{} --fetch-media".format(reviewed_dataset, dataset))

Hope this helps!

ines · April 13, 2021, 2:15am

Thanks for sharing! One quick comment on that in case others come across this thread later: it's not really necessary to add the data to a dataset – instead, you can also load it directly from a .jsonl file and set --loader jsonl.

Topic		Replies	Views
Upload existing text (previously transcribed) and editing it in Prodigy Audio Transcription recipe enhancement , audio	3	301	November 9, 2023
✨ Audio annotation UI (beta) news , audio	21	4958	March 10, 2023
Prodigy error when reviewing audio annotation coupled with videos usage , audio , video	7	824	December 5, 2020
Multi-stage speaker audio classification with `pyannote.sad.manual` and `audio manual` usage , custom , audio	13	2106	September 28, 2020
prodigy use case for annotation having pre-annotated text usage , solved	8	1263	March 11, 2019

Correct Audio Transcription

Related topics