I have my transcription as fluid text saved in asr_transcription.txt.
I create a dictionary (jsonl) where I put the transcription as a string alongside the requested key words 'audio', 'text' etc as shown below (the hash numbers are practically random). I then add this dictionary to the databse as shown here: https://prodi.gy/docs/api-database.
At the end I call prodigy audio.transcribe with a new dataset name and dataset:dataset that I added to the database earlier and --fetch-media to fetch the audio file specified in the jsonl dictionary.
with open("asr_transcription.txt", encoding="utf-8") as asr_read:
text = ""
for line in asr_read.readlines():
if line[0].isalpha():
text += line.strip()+"\n"
jsonl = {'audio': filename, 'text': '', 'meta': {'file': filename}, 'path': filename, '_input_hash': -758377432, '_task_hash': -1593748291, '_session_id': None, '_view_id': 'blocks', 'audio_spans': [],"transcript": text, "answer": "accept"}
# save asr transcription to prodigy database
db = connect()
# hash the examples to make sure they all have a unique task hash and input hash – this is used by Prodigy to distinguish between annotations on the same input data
jsonl_annotations = [set_hashes(jsonl)]
# create new dataset
db.add_dataset(dataset)
# add examples to the (existing!) dataset
db.add_examples(jsonl_annotations, datasets=[dataset])
#open prodigy to review automatic transcriptionion (--fetch-media loads in audio as well)
prodigy.serve("audio.transcribe {} dataset:{} --fetch-media".format(reviewed_dataset, dataset))
Hope this helps!