I generated an audio transcription using ASR (and wrote it into a jsonl file) which I would like to correct manually using Prodigy. I understand how to load in a transcription to correct it using dataset: but I cannot load in the audio file that I want to transcribe.
I tried:
prodigy audio.transcribe new_transcription ./recordings dataset:old_transcription
and I got:
prodigy audio.transcribe: error: unrecognized arguments: dataset:old_transcription
When running without the ./recordings part, it works perfectly fine (and the dataset old_transcription certainly does exist) but I obviously don't have the audio loaded in which would be useful for transcribing.
Hi! I think what you're looking for is the --fetch-media flag: this will load back the media (e.g. audio files) from the paths in the JSON when you load your previous annotations back in. Just make sure the data is available via the same path.
Ok, I actually have one follow up question: it should work with prodigy.serve() integrated in a python script as well, shouldn't it?
So what I'm doing is
prodigy.serve(" prodigy audio.transcribe new dataset:old --fetch-media")
but it doesn't load the old transcription nor the audio file even though it is working as a command in the terminal. Am I missing something?
Cheers
prodigy.serve should resolve to exactly the same process as running the command from the command line. Are you running the script from the same working directory? If your paths are relative paths to a local directory, the execution context of the script could mean that the paths aren't found.
Right, so I got that working now but what I am actually trying to do is loading a transcription jsonl which was not made using Prodigy and I think that might cause the problem of not loading the audio file.
I looked at the structure of prodigy created jsonl files and created one like this:
The audio file is in the same directory as my python script and I did try and use its absolute path, which had no effect at all.
I'm pretty sure I got the structure and all key words right (haven't I?) and I'm guessing the problem might come from the input hash and task_hash. I just used hashes from a real prodigy output file and changed some digits, but I also use set_hashes, so I don't know if that causes some kind of error?
Hi! When you say the audio isn't loading, do you mean that the examples show up but the audio widget / waveform is empty? Or do the examples not get sent out and you see "No tasks available"?
If the waveform is empty, try setting the PRODIGY_LOGGING=basic environment variable and see if there's any output in the logs that indicates that media content couldn't be loaded.
Is there anything in the dataset annotations? The hashes are mostly relevant for detecting whether two examples are identical, and as a result, whether an example should be presented to you for annotation, or whether it's already annotated in the dataset and should be skipped. So if you already have examples with the same hashes in the dataset, new examples coming in with the same hashes would be skipped.
Thanks! And just to confirm, the same command works fine if you run it on the CLI and the audio data is displayed correctly?
Also, could you share the logging output that's shown once you open the app in the browser and load the first examples? It should start with something like POST: /get_session_questions etc.
Yes and no
When I do prodigy audio.transcribe first path/to/audio
,save some transcription and then prodigy audio.transcribe second dataset:first --fetch-media
it works (transcript and audio are loaded).
But when I want to do the same thing with a manually created jsonl that I add to the database as shown here
the audio widget stays empty whether I'm calling it from a python script or from the shell.
So there must be a problem with the way I'm incorporating (the path to) the audio file in the manually created jsonl file which looks like this:
What I'm curious about is the path key, because in the jsonl files I looked at that were created by Prodigy, its value is only the name of the audio file, so just like the value of the audio key which doesn't seem logical to me. I tried replacing it by its absolute path which unfortunately didn't change anything. But I think we might be getting close to solving the problem now ^^.
14:08:45: POST: /get_session_questions
14:08:45: FEED: Finding next batch of questions in stream
14:08:45: RESPONSE: /get_session_questions (1 examples)
INFO: 127.0.0.1:53862 - "POST /get_session_questions HTTP/1.1" 200 OK
The "path" key is only really used to keep a copy of the original path when the value of "audio" etc. is replaced by the loaded base64-encoded data. Before the examples are placed in the database, Prodigy will then strip out the base64 data again and replace it with the path, so the data stored in the DB stays compact. (This is the default behaviour for audio recipes unless you set --keep-base64.)
If fetching the data works with the existing dataset and not your generated examples, there must be a subtle difference here What does the resulting stream look like when you call the fetch_media preprocessor from Python, e.g. like this? Does it have base64 strings in it?
from prodigy.components.preprocess import fetch_media
stream = fetch_media(sample_examples)
print(list(stream))
But anyway, using a list, it returns a long base64 encoded string like this tupbGJU5bzS8vgEESABCDfIapUYrww4yMbV0vtzg/NA6zhxFcsMbG2Icuk6xL8CMzuafZF3YxCZQn5sK12/3rOc/VqYfR7P3JWXclbSd7bbU9b2+jLXXrFf69txtQp/qkKPrF86tiJ6fWZ7a+ocCmd7zr41W9Pq/xT5x/67zreNa3akWbFs+/v4FoGK
(And when decoded, it's nonsense characters.)
Hmm, this all looks correct to me โ the fact that it encodes the data means that the file paths are found and are correct. (Don't worry about the decoded data not making sense because that's literally the binary audio data.)
I'll experiment with this some more but it's definitely pretty mysterious that it works if you run it one way and not the other If you have a reproducible example that I can run, that would definitely be helpful.
I have my transcription as fluid text saved in asr_transcription.txt.
I create a dictionary (jsonl) where I put the transcription as a string alongside the requested key words 'audio', 'text' etc as shown below (the hash numbers are practically random). I then add this dictionary to the databse as shown here: https://prodi.gy/docs/api-database.
At the end I call prodigy audio.transcribe with a new dataset name and dataset:dataset that I added to the database earlier and --fetch-media to fetch the audio file specified in the jsonl dictionary.
with open("asr_transcription.txt", encoding="utf-8") as asr_read:
text = ""
for line in asr_read.readlines():
if line[0].isalpha():
text += line.strip()+"\n"
jsonl = {'audio': filename, 'text': '', 'meta': {'file': filename}, 'path': filename, '_input_hash': -758377432, '_task_hash': -1593748291, '_session_id': None, '_view_id': 'blocks', 'audio_spans': [],"transcript": text, "answer": "accept"}
# save asr transcription to prodigy database
db = connect()
# hash the examples to make sure they all have a unique task hash and input hash โ this is used by Prodigy to distinguish between annotations on the same input data
jsonl_annotations = [set_hashes(jsonl)]
# create new dataset
db.add_dataset(dataset)
# add examples to the (existing!) dataset
db.add_examples(jsonl_annotations, datasets=[dataset])
#open prodigy to review automatic transcriptionion (--fetch-media loads in audio as well)
prodigy.serve("audio.transcribe {} dataset:{} --fetch-media".format(reviewed_dataset, dataset))
Thanks for sharing! One quick comment on that in case others come across this thread later: it's not really necessary to add the data to a dataset โ instead, you can also load it directly from a .jsonl file and set --loader jsonl.