✨ Audio annotation UI (beta)

I posted a teaser of this on Twitter the other day :smiley: The prototype worked well enough that we decided to ship it as a "secret" beta feature in the latest v1.9.4. So if you're working with audio and want to test it, it's super easy now.

Requirements

  • Latest Prodigy v1.9.4 (released today, December 28)
  • A directory of audio files (.mp3 or .wav)

How it works

New interfaces: There are two new interfaces: audio (display an audio file with optional pre-defined, non-editable regions) and audio_manual (display an audio file and allow user to draw, edit or remove regions for one or more labels).

New loader: Prodigy also ships a new loader, audio-server, which serves the audio files via the local Prodigy web server so they can be loaded in the app. Each task it creates (and that's later saved with the annotations in the database) also includes the original file path and file name. Instead of using the audio server loader, you can of course also load in a JSONL file where each task specifies a live URL for "audio" (just no local paths, since your browser will block that).

UI settings: You can toggle play/pause using the enter key. I've found that I often want to use the space bar (maybe because video editing tools do it this way, too?) so you might want to update the "keymap" in your prodigy.json or recipe config to remap the keys: "keymap": {"playpause": ["space"], "ignore": ["enter"]}

The audio UI suports the following settings: "show_audio_cursor": true (display a line at the position of the cursor), "show_audio_timeline": false (show a timeline) and "audio_minimap": true (show a minimap of the whole file).

Data format: Audio tasks need to contain an "audio" key and labelled regions are represented as "audio_spans" with a "start" and "end" (timestamp in seconds), a "label", an optional "color" and an "id".

Example 1: Manual audio annotation

Stream in audio files from a directory and annotate regions in them using the given labels. Use cases: diarization (speaker identification), selecting noise, disfluencies etc.

prodigy mark audio_dataset "/path/to/audios" --loader audio-server --label LABEL1,LABEL2 --view-id audio_manual

Example 2: Binary audio annotation

Stream in audio files from a directory and collect binary annotations for a given (optional) label. Use cases: binary audio classification, data curation, etc.

prodigy mark audio_dataset "/path/to/audios" --loader audio-server --label GOOD_QUALITY --view-id audio

Example 3: Manual transcript

Load in audio files from a directory and ask the user to transcribe the audio manually. If you already have transcripts, you could also write a stream generator that pre-populates the "transcript" field for each audio task (so the annotators only need to correct it).

import prodigy
from prodigy.components.loaders import AudioServer
from prodigy.util import split_string

@prodigy.recipe(
    "audio-transcript",
    dataset=("The dataset to save annotations to", "positional", None, str),
    source=("Directory of audio files", "positional", None, str)
)
def audio_transcript(dataset: str, source: str):
    stream = AudioServer(source)
    blocks = [
        {"view_id": "audio"},
        {"view_id": "text_input", "field_rows": 2, "field_label": "Transcript", "field_id": "transcript"},
    ]
    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "blocks",
        "config": {"blocks": blocks},
    }

Example 4: Alignment of text and audio

Highlight spans in the text that correspond to the audio, and vice versa. How you set this up depends on your requirements: you can load in already existing annotated regions in the text ("spans") or audio ("audio_spans"), or do both from scratch. For instance, if you have text with existing spans for disfluencies, you could ask the annotator to select the corresponding regions in the audio.

import prodigy
from prodigy.components.loaders import AudioServer
from prodigy.components.preprocess import add_tokens
from prodigy.util import split_string
import spacy

TRANSCRIPTS = {
    "/path/to/file1.mp3": "This is a transcript...",
    "/path/to/file2.mp3": "This is another transcript..."
}

@prodigy.recipe(
    "audio-alignment",
    dataset=("The dataset to save annotations to", "positional", None, str),
    source=("Directory of audio files", "positional", None, str), 
    label=("One or more comma-separated labels", "option", "l", split_string),
    lang=("Language for text tokenization", "option", "ln", str),
)
def audio_alignment(dataset: str, source: str, label: list = [], lang: str = "en"):
    def get_stream():
        stream = AudioServer(source)
        for eg in stream:
             # Get transcript for the audio file
             if eg["path"] in TRANSCRIPTS:
                 eg["text"] = TRANSCRIPTS[eg["path"]]
                 yield eg
    
    nlp = spacy.blank(lang)
    stream = get_stream()
    stream = add_tokens(nlp, stream)  # add tokens for manual text highlighting

    blocks = [
        {"view_id": "audio_manual", "labels": label}, 
        {"view_id": "ner_manual", "labels": label}
    ]
    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "blocks",
        "config": {"blocks": blocks},
    }

:warning: Known problems / open questions

  • If you go back and forth (submit and undo) very fast, it may cause the audio loading to fail and existing regions won't be drawn correctly because there's no audio loaded. It's usually solved by going back and forth again. I've been thinking about adding a "reload" button that completely reloads the current annotation card and audio, in case something goes wrong (but I haven't found a nice solution for this yet).
  • Resizing is a bit fiddly and you need to hit the exact region boundary. The cursor will then become a "resize" cursor (as opposed to the "move" cursor).
  • The loader currently only selects .wav and .mp3 files. Are there any other formats it should support?

Also: I need a cool test audio that we can use for the docs later on – ideally not too long, with different speakers, maybe disfluencies etc. And with a suitable license (public domain, CC etc.). Any ideas or suggestions? :slightly_smiling_face:

1 Like

HI @ines,

Would this audio annotator support live transcription? That is, live streaming of audio, and providing real-time transcription.

Thanks.

It could, but I'm not sure it'd be that useful? For annotation, especially if the goal is creating training data, you typically want the data collection process to be reliable and reproducible. There's way too much variance in live streams and it makes the whole process unpredictable. The focus of any audio workflows will definitely be creating training data for machine learning, not doing real-time transcription for other processes.