Combine audio.manual and audio.transcribe?

We have a task to be able to annotate/classify audio and at the same time update the transcription. Is it possible to combine audio.manual and audio.transcribe in the same session?


When you want to combine recipes, your best bet is to write a custom recipe that combines views using the blocks feature. Since I've never worked with Prodigy and audio before though, I figured it fun to explore this one for you.

The code

Here's the custom recipe for your task.

import prodigy 
from typing import List, Optional, Union, Iterable

from prodigy.components.loaders import get_stream
from prodigy.components.preprocess import fetch_media as fetch_media_preprocessor
from prodigy.util import log, msg, get_labels, split_string
from prodigy.types import TaskType, RecipeSettingsType

def remove_base64(examples: List[TaskType]) -> List[TaskType]:
    """Remove base64-encoded string if "path" is preserved in example."""
    for eg in examples:
        if "audio" in eg and eg["audio"].startswith("data:") and "path" in eg:
            eg["audio"] = eg["path"]
        if "video" in eg and eg["video"].startswith("data:") and "path" in eg:
            eg["video"] = eg["path"]
    return examples

    # fmt: off
    dataset=("Dataset to save annotations to", "positional", None, str),
    source=("Data to annotate (file path or '-' to read from standard input)", "positional", None, str),
    loader=("Loader to use", "option", "lo", str),
    label=("Comma-separated label(s) to annotate or text file with one label per line", "option", "l", get_labels),
    keep_base64=("If 'audio' loader is used: don't remove base64-encoded data from the data on save", "flag", "B", bool),
    autoplay=("Autoplay audio when a new task loads", "flag", "A", bool),
    fetch_media=("Convert URLs and local paths to data URIs", "flag", "FM", bool),
    exclude=("Comma-separated list of dataset IDs whose annotations to exclude", "option", "e", split_string),
    # fmt: on
def custom(
    dataset: str,
    source: Union[str, Iterable[dict]],
    loader: Optional[str] = "audio",
    label: Optional[List[str]] = None,
    autoplay: bool = False,
    keep_base64: bool = False,
    fetch_media: bool = False,
    exclude: Optional[List[str]] = None,
    text_rows: int = 4,
    field_id: str = "transcript",
) -> RecipeSettingsType:
    log("RECIPE: Starting recipe audio.custom", locals())
    if label is None:"audio.custom requires at least one --label", exits=1)
    stream = get_stream(source, loader=loader, dedup=True, rehash=True, is_binary=False)
    if fetch_media:
        stream = fetch_media_preprocessor(stream, ["audio", "video"])
    blocks = [
        {"view_id": "audio_manual"},
            "view_id": "text_input",
            "field_rows": text_rows,
            "field_label": "Transcript",
            "field_id": field_id,
            "field_autofocus": True,

    return {
        "view_id": "blocks",
        "dataset": dataset,
        "stream": stream,
        "before_db": remove_base64 if not keep_base64 else None,
        "exclude": exclude,
        "config": {
            "blocks": blocks,
            "labels": label,
            "audio_autoplay": autoplay,
            "auto_count_stream": True,

Pay close attention to the blocks that I've defined. You'll notice that I'm referring to two interfaces;

  • The audio_manual interface, which is listed on top.
  • The text_input interface, which is shown below. It also carries some extra settings which will appear in the annotation and change the appearance.

Using the recipe

To use this recipe, I've prepared an audio file and moved it into a folder called audios. Next, I run Prodigy via:

python -m prodigy audio.custom issue-5993 audios --label low,high -F

The annotations will go into a dataset called issue-5993, it will take all the audio files from the audios folder and it will use the high and low labels. Finally, I also make sure that the custom Python script is attached so that Prodigy can recognise the audio.custom name.

This gives me an interface that looks like this:


When I save this annotation then Prodigy then I can fetch the annotation via:

python -m prodigy db-out issue-5993 | jq

In my case, it looked like this:

  "audio": "audios/voice-demo.m4a",
  "text": "voice-demo",
  "meta": {
    "file": "voice-demo.m4a"
  "path": "audios/voice-demo.m4a",
  "_input_hash": 696511040,
  "_task_hash": 130121222,
  "_is_binary": false,
  "_view_id": "blocks",
  "audio_spans": [
      "start": 0.8781535226,
      "end": 2.5764079916,
      "label": "low",
      "id": "2d57f5b4-2027-4e88-8a34-d228000e5ab1",
      "color": "rgba(255,215,0,0.2)"
      "start": 2.9645804416,
      "end": 3.9419432177,
      "label": "high",
      "id": "205b4348-6fa5-4f4d-bd6e-a47e782d25ae",
      "color": "rgba(0,255,255,0.2)"
  "transcript": "I'm talking with a low voice. \n\n And a high one!",
  "answer": "accept",
  "_timestamp": 1664529435

I think this is what you want. If you're interested in learning more about how to make custom recipes in general you may appreciate this tutorial video I've made on Youtube. It's not for an audio use-case, but it does shed some light on how custom recipes work in Prodigy.

Thank you very much. We are new to Prodigy, but I guess this is the way for us. But is there a way to load both audios and transcripts when starting a new session? Our use case is that we both have the audios and original transcripts on separate folders, and we want to annotate the audio and also fix its matching original transcript.

I think what we need to do is combine audio.transcribe and the classify-audio recipe as shown here:

We will try first what we can based on the code above. Thanks.

1 Like

Yeah you can adapt the recipe as needed. Note that if you add a transcript-key (because this corresponds with the json output) to the stream you can control what is currently filled.

If I only change the bottom bit of the recipe to become:


    def add_options(stream):
        for item in stream:
            item['transcript'] = 'i am prefilled!'
            yield item

    return {
        "view_id": "blocks",
        "dataset": dataset,
        "stream": add_options(stream),
        "before_db": remove_base64 if not keep_base64 else None,
        "exclude": exclude,
        "config": {
            "blocks": blocks,
            "labels": label,
            "audio_autoplay": autoplay,
            "auto_count_stream": True,

Then the transcript is pre-filled. Here's what that would look like:

You should also be able to do something similar with the spansand you can also expand this example such that the filled transcript depends on the example that you've provided. Anything you can write with Python you can use to customise here :smile: .