Right, so I got that working now but what I am actually trying to do is loading a transcription jsonl which was not made using Prodigy and I think that might cause the problem of not loading the audio file.
I looked at the structure of prodigy created jsonl files and created one like this:
The audio file is in the same directory as my python script and I did try and use its absolute path, which had no effect at all.
I'm pretty sure I got the structure and all key words right (haven't I?) and I'm guessing the problem might come from the input hash and task_hash. I just used hashes from a real prodigy output file and changed some digits, but I also use set_hashes, so I don't know if that causes some kind of error?
Hi! When you say the audio isn't loading, do you mean that the examples show up but the audio widget / waveform is empty? Or do the examples not get sent out and you see "No tasks available"?
If the waveform is empty, try setting the PRODIGY_LOGGING=basic environment variable and see if there's any output in the logs that indicates that media content couldn't be loaded.
Is there anything in the dataset annotations? The hashes are mostly relevant for detecting whether two examples are identical, and as a result, whether an example should be presented to you for annotation, or whether it's already annotated in the dataset and should be skipped. So if you already have examples with the same hashes in the dataset, new examples coming in with the same hashes would be skipped.
Thanks! And just to confirm, the same command works fine if you run it on the CLI and the audio data is displayed correctly?
Also, could you share the logging output that's shown once you open the app in the browser and load the first examples? It should start with something like POST: /get_session_questions etc.
Yes and no
When I do prodigy audio.transcribe first path/to/audio
,save some transcription and then prodigy audio.transcribe second dataset:first --fetch-media
it works (transcript and audio are loaded).
But when I want to do the same thing with a manually created jsonl that I add to the database as shown here
the audio widget stays empty whether I'm calling it from a python script or from the shell.
So there must be a problem with the way I'm incorporating (the path to) the audio file in the manually created jsonl file which looks like this:
What I'm curious about is the path key, because in the jsonl files I looked at that were created by Prodigy, its value is only the name of the audio file, so just like the value of the audio key which doesn't seem logical to me. I tried replacing it by its absolute path which unfortunately didn't change anything. But I think we might be getting close to solving the problem now ^^.
14:08:45: POST: /get_session_questions
14:08:45: FEED: Finding next batch of questions in stream
14:08:45: RESPONSE: /get_session_questions (1 examples)
INFO: 127.0.0.1:53862 - "POST /get_session_questions HTTP/1.1" 200 OK
The "path" key is only really used to keep a copy of the original path when the value of "audio" etc. is replaced by the loaded base64-encoded data. Before the examples are placed in the database, Prodigy will then strip out the base64 data again and replace it with the path, so the data stored in the DB stays compact. (This is the default behaviour for audio recipes unless you set --keep-base64.)
If fetching the data works with the existing dataset and not your generated examples, there must be a subtle difference here What does the resulting stream look like when you call the fetch_media preprocessor from Python, e.g. like this? Does it have base64 strings in it?
from prodigy.components.preprocess import fetch_media
stream = fetch_media(sample_examples)
print(list(stream))
But anyway, using a list, it returns a long base64 encoded string like this tupbGJU5bzS8vgEESABCDfIapUYrww4yMbV0vtzg/NA6zhxFcsMbG2Icuk6xL8CMzuafZF3YxCZQn5sK12/3rOc/VqYfR7P3JWXclbSd7bbU9b2+jLXXrFf69txtQp/qkKPrF86tiJ6fWZ7a+ocCmd7zr41W9Pq/xT5x/67zreNa3akWbFs+/v4FoGK
(And when decoded, it's nonsense characters.)
Hmm, this all looks correct to me – the fact that it encodes the data means that the file paths are found and are correct. (Don't worry about the decoded data not making sense because that's literally the binary audio data.)
I'll experiment with this some more but it's definitely pretty mysterious that it works if you run it one way and not the other If you have a reproducible example that I can run, that would definitely be helpful.
I have my transcription as fluid text saved in asr_transcription.txt.
I create a dictionary (jsonl) where I put the transcription as a string alongside the requested key words 'audio', 'text' etc as shown below (the hash numbers are practically random). I then add this dictionary to the databse as shown here: https://prodi.gy/docs/api-database.
At the end I call prodigy audio.transcribe with a new dataset name and dataset:dataset that I added to the database earlier and --fetch-media to fetch the audio file specified in the jsonl dictionary.
with open("asr_transcription.txt", encoding="utf-8") as asr_read:
text = ""
for line in asr_read.readlines():
if line[0].isalpha():
text += line.strip()+"\n"
jsonl = {'audio': filename, 'text': '', 'meta': {'file': filename}, 'path': filename, '_input_hash': -758377432, '_task_hash': -1593748291, '_session_id': None, '_view_id': 'blocks', 'audio_spans': [],"transcript": text, "answer": "accept"}
# save asr transcription to prodigy database
db = connect()
# hash the examples to make sure they all have a unique task hash and input hash – this is used by Prodigy to distinguish between annotations on the same input data
jsonl_annotations = [set_hashes(jsonl)]
# create new dataset
db.add_dataset(dataset)
# add examples to the (existing!) dataset
db.add_examples(jsonl_annotations, datasets=[dataset])
#open prodigy to review automatic transcriptionion (--fetch-media loads in audio as well)
prodigy.serve("audio.transcribe {} dataset:{} --fetch-media".format(reviewed_dataset, dataset))
Thanks for sharing! One quick comment on that in case others come across this thread later: it's not really necessary to add the data to a dataset – instead, you can also load it directly from a .jsonl file and set --loader jsonl.
Still shows No tasks available after the first example.
17:55:11: FEED: Finding next batch of questions in stream
⚠ Warning: filtered 75% of entries because they were duplicates. Only 1 items
were shown out of 4. You may want to deduplicate your dataset ahead of time to
get a better understanding of your dataset size.
17:55:11: RESPONSE: /get_session_questions (1 examples)
INFO: ::1:55235 - "POST /get_session_questions HTTP/1.1" 200 OK
17:55:13: POST: /get_session_questions
17:55:13: FEED: Finding next batch of questions in stream
17:55:13: FEED: skipped: 21806991
17:55:13: RESPONSE: /get_session_questions (0 examples)
INFO: ::1:55235 - "POST /get_session_questions HTTP/1.1" 200 OK
It looks like the main problem here is that all examples except for one are filtered out because they're considered duplicates, mosty likely because they ended up with the same hashes.
One potential problem is here:
The keys should be iterables of strings, but ("audio") ends up being "audio". So you'd either want this to be ("audio",) or ["audio"].
Thanks for the prompt answer but even with ("audio",) or ["audio"] I am having the same issue. I don't think the set_hashes() mapping to the same hash is the issue since after calling the function, I log both task and input hashes and they all seem to be different:
DEBUG:root:Inputing example TASK hashes:
DEBUG:root:[-1464920486, 1203448608, -244439580, 1932216617]
DEBUG:root:Inputing example INPUT hashes:
DEBUG:root:[158202857, 1296278939, 318945175, 549324070]
What might be another reason for this? Should I also delete the "Session" datasets before rerunning the script (I run prodigy drop input_dataset and corrected_dataset after each run of the script)?
One thing I noticed was the script is rehashing the stream:
maybe this is causing the hashes to be the same because I have no control over the parameters of this hashing?
EDIT: For each example in the stream, the text field was ''. When I passed it the audio path to as text field, there were no duplicates and all 4 examples show up in UI. This supports the hypothesis that Prodigy is re-hashing the examples input to the dataset by their text fields. Is there any way to circumvent this?
Ah, thanks for the analysis, it looks like the recipe is indeed rehashing the stream I can't think of a good reason why this is done in this particular recipe, so we should remove that. The eaiest workaround is to just remove the rehash=True in recipes/audio.py (you can run prodigy stats to find the location of your Prodigy installation).