✨ Audio annotation UI (beta)

I posted a teaser of this on Twitter the other day :smiley: The prototype worked well enough that we decided to ship it as a "secret" beta feature in the latest v1.9.4. So if you're working with audio and want to test it, it's super easy now.


  • Latest Prodigy v1.9.4 (released today, December 28)
  • A directory of audio files (.mp3 or .wav)

How it works

New interfaces: There are two new interfaces: audio (display an audio file with optional pre-defined, non-editable regions) and audio_manual (display an audio file and allow user to draw, edit or remove regions for one or more labels).

New loader: Prodigy also ships a new loader, audio-server, which serves the audio files via the local Prodigy web server so they can be loaded in the app. Each task it creates (and that's later saved with the annotations in the database) also includes the original file path and file name. Instead of using the audio server loader, you can of course also load in a JSONL file where each task specifies a live URL for "audio" (just no local paths, since your browser will block that).

UI settings: You can toggle play/pause using the enter key. I've found that I often want to use the space bar (maybe because video editing tools do it this way, too?) so you might want to update the "keymap" in your prodigy.json or recipe config to remap the keys: "keymap": {"playpause": ["space"], "ignore": ["enter"]}

The audio UI suports the following settings: "show_audio_cursor": true (display a line at the position of the cursor), "show_audio_timeline": false (show a timeline) and "audio_minimap": true (show a minimap of the whole file).

Data format: Audio tasks need to contain an "audio" key and labelled regions are represented as "audio_spans" with a "start" and "end" (timestamp in seconds), a "label", an optional "color" and an "id".

Example 1: Manual audio annotation

Stream in audio files from a directory and annotate regions in them using the given labels. Use cases: diarization (speaker identification), selecting noise, disfluencies etc.

prodigy mark audio_dataset "/path/to/audios" --loader audio-server --label LABEL1,LABEL2 --view-id audio_manual

Example 2: Binary audio annotation

Stream in audio files from a directory and collect binary annotations for a given (optional) label. Use cases: binary audio classification, data curation, etc.

prodigy mark audio_dataset "/path/to/audios" --loader audio-server --label GOOD_QUALITY --view-id audio

Example 3: Manual transcript

Load in audio files from a directory and ask the user to transcribe the audio manually. If you already have transcripts, you could also write a stream generator that pre-populates the "transcript" field for each audio task (so the annotators only need to correct it).

import prodigy
from prodigy.components.loaders import AudioServer
from prodigy.util import split_string

    dataset=("The dataset to save annotations to", "positional", None, str),
    source=("Directory of audio files", "positional", None, str)
def audio_transcript(dataset: str, source: str):
    stream = AudioServer(source)
    blocks = [
        {"view_id": "audio"},
        {"view_id": "text_input", "field_rows": 2, "field_label": "Transcript", "field_id": "transcript"},
    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "blocks",
        "config": {"blocks": blocks},

Example 4: Alignment of text and audio

Highlight spans in the text that correspond to the audio, and vice versa. How you set this up depends on your requirements: you can load in already existing annotated regions in the text ("spans") or audio ("audio_spans"), or do both from scratch. For instance, if you have text with existing spans for disfluencies, you could ask the annotator to select the corresponding regions in the audio.

import prodigy
from prodigy.components.loaders import AudioServer
from prodigy.components.preprocess import add_tokens
from prodigy.util import split_string
import spacy

    "/path/to/file1.mp3": "This is a transcript...",
    "/path/to/file2.mp3": "This is another transcript..."

    dataset=("The dataset to save annotations to", "positional", None, str),
    source=("Directory of audio files", "positional", None, str), 
    label=("One or more comma-separated labels", "option", "l", split_string),
    lang=("Language for text tokenization", "option", "ln", str),
def audio_alignment(dataset: str, source: str, label: list = [], lang: str = "en"):
    def get_stream():
        stream = AudioServer(source)
        for eg in stream:
             # Get transcript for the audio file
             if eg["path"] in TRANSCRIPTS:
                 eg["text"] = TRANSCRIPTS[eg["path"]]
                 yield eg
    nlp = spacy.blank(lang)
    stream = get_stream()
    stream = add_tokens(nlp, stream)  # add tokens for manual text highlighting

    blocks = [
        {"view_id": "audio_manual", "labels": label}, 
        {"view_id": "ner_manual", "labels": label}
    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "blocks",
        "config": {"blocks": blocks},

:warning: Known problems / open questions

  • If you go back and forth (submit and undo) very fast, it may cause the audio loading to fail and existing regions won't be drawn correctly because there's no audio loaded. It's usually solved by going back and forth again. I've been thinking about adding a "reload" button that completely reloads the current annotation card and audio, in case something goes wrong (but I haven't found a nice solution for this yet).
  • Resizing is a bit fiddly and you need to hit the exact region boundary. The cursor will then become a "resize" cursor (as opposed to the "move" cursor).
  • The loader currently only selects .wav and .mp3 files. Are there any other formats it should support?

Also: I need a cool test audio that we can use for the docs later on – ideally not too long, with different speakers, maybe disfluencies etc. And with a suitable license (public domain, CC etc.). Any ideas or suggestions? :slightly_smiling_face:


HI @ines,

Would this audio annotator support live transcription? That is, live streaming of audio, and providing real-time transcription.


It could, but I'm not sure it'd be that useful? For annotation, especially if the goal is creating training data, you typically want the data collection process to be reliable and reproducible. There's way too much variance in live streams and it makes the whole process unpredictable. The focus of any audio workflows will definitely be creating training data for machine learning, not doing real-time transcription for other processes.

Thanks Ines! I think as currently this can be for transcribing short utterances. For longer audio files (e.g. conversations), many would like to transcribe and submit each segment/utterance/sentence one after one, which makes the whole task streamlining instead of having to transcribing and submitting the whole audio file at once. Another rather important feature is automatically adding the time stamps for the transcribed utterances/phrases (this is what we need actually).

I think Prodigy would meet the requirements of many who prefer to have something that is more user friendly and not as complex as Praat (http://www.fon.hum.uva.nl/praat/).

Yes, I agree – I think smaller chunks are more efficient to work with. Cutting up the files is probably a better fit for the Python / recipe level? For instance, you could use a Python audio library to automatically cut the files (e.g. on X milliseconds or more of silence), log the start / end positions, save the resulting snippets out to individual files and then stream in that directory.

Thanks for this new feature, this is great !
I'am currently experimenting taging portions of audio which contain last names. Sometimes this portions are very short and it is difficult to be accurate. It would be nice if we could zoom in on the wave.

@Marie Thanks, that's nice to hear! I'll look into options for zooming, that's a good point.

Other features I'll be working on:

  • try and improve loading times for files by using a native audio player under the hood
  • add support for base64-encoded audio data (faster loading, allows storing data with the annotations). I had initially ruled this out because I thought audio data was too big, but turns out it's actually totally viable for compressed audio
  • assign different colours to regions based on the label (just like the bounding boxes in image_manual etc.) – at the moment, it only supports pre-defined "color" values on the audio spans
  • allow changing label of existing region (without having to re-add it)

We're also currently exploring options for different audio workflows for semi-automatic annotation and using active learning with a model in the loop, e.g. for speaker diarization. I'll hopefully be able to share more on this soon :slightly_smiling_face:

Thanks for the update and this cool feature Ines. I'm just dropping a note to let you know my company and I will be trying this feature out over the coming month. I'll have to post back with details on how that goes. Our specific possible implementation has to do with Ag-Tech and saving piglet lives.


Thanks @ines his is a great feature to include in Prodigy, thank you.

In version 1.9.9 though there seems to be a bug. If i run your Manual audio annotation example as indicated:

prodigy mark audio_dataset "/path/to/audios" --loader audio-server --label LABEL1,LABEL2 --view-id audio_manual

The first audio will load properly as seen here:

As I mark the green button to go to the next audiowave disappears, so I lose the capacity of doing the manual audio annotation.

Only if I reload the browser will this work properly again, but only for the first audio file loaded.

Any pointers?

Changing the batch_size in prodigy.json to 1 resolves the issue i mention above.

@cesarandreslopez Thanks for testing the interface and the feedback. It sounds like this is related to the loading process crashing somewhere in between :thinking: The next version of Prodigy will also provide an Audio loader that converts the audio data to base64, which should solve the loading issues – and it'll also an option to drop that from the examples before they're saved in the database (to prevent bloat).

(If your files aren't too big, you can already try and stream in base64-encoded strings as the "audio" value. It should already work out-of-the-box – you'll just end up with the data in your database, which may take up quite a bit of space.)

Hi @ines, this is a great feature for Prodigy that matches with my project. I would like to know if it supports stereo audio and how does it work in that case? I mean, can I label audio regions in each channel and also align text to the corresponding audio region in each channel?

That's an interesting use case. So how would this look in practice? Would you want to see both channels separately and also be able to annotate them separately, or at least, align the regions differently?

Displaying split channels is definitely no problem, but I'm not sure how easy it'd be to allow per-channel region annotations.

Thanks for your answer.

Yes, in case of a stereo call, if we see both channels separately and then we can annotate them separately, it will allow us to have for example the label associated to the audio and the channel.

For example, this could be useful to label sentiment for each speaker, label incorrect words and provide its correct transcription indicating the channel that contains that information.

Ah, so you'd have different content on the different channels? At the moment, you'd have to split the channels and stream them in as separate tasks in order. You can still add custom meta information to the tasks that tell you which task is related to which file and channel.

1 Like