✨ Audio annotation UI (beta)

I posted a teaser of this on Twitter the other day :smiley: The prototype worked well enough that we decided to ship it as a "secret" beta feature in the latest v1.9.4. So if you're working with audio and want to test it, it's super easy now.

Requirements

  • Latest Prodigy v1.9.4 (released today, December 28)
  • A directory of audio files (.mp3 or .wav)

How it works

New interfaces: There are two new interfaces: audio (display an audio file with optional pre-defined, non-editable regions) and audio_manual (display an audio file and allow user to draw, edit or remove regions for one or more labels).

New loader: Prodigy also ships a new loader, audio-server, which serves the audio files via the local Prodigy web server so they can be loaded in the app. Each task it creates (and that's later saved with the annotations in the database) also includes the original file path and file name. Instead of using the audio server loader, you can of course also load in a JSONL file where each task specifies a live URL for "audio" (just no local paths, since your browser will block that).

UI settings: You can toggle play/pause using the enter key. I've found that I often want to use the space bar (maybe because video editing tools do it this way, too?) so you might want to update the "keymap" in your prodigy.json or recipe config to remap the keys: "keymap": {"playpause": ["space"], "ignore": ["enter"]}

The audio UI suports the following settings: "show_audio_cursor": true (display a line at the position of the cursor), "show_audio_timeline": false (show a timeline) and "audio_minimap": true (show a minimap of the whole file).

Data format: Audio tasks need to contain an "audio" key and labelled regions are represented as "audio_spans" with a "start" and "end" (timestamp in seconds), a "label", an optional "color" and an "id".

Example 1: Manual audio annotation

Stream in audio files from a directory and annotate regions in them using the given labels. Use cases: diarization (speaker identification), selecting noise, disfluencies etc.

prodigy mark audio_dataset "/path/to/audios" --loader audio-server --label LABEL1,LABEL2 --view-id audio_manual

Example 2: Binary audio annotation

Stream in audio files from a directory and collect binary annotations for a given (optional) label. Use cases: binary audio classification, data curation, etc.

prodigy mark audio_dataset "/path/to/audios" --loader audio-server --label GOOD_QUALITY --view-id audio

Example 3: Manual transcript

Load in audio files from a directory and ask the user to transcribe the audio manually. If you already have transcripts, you could also write a stream generator that pre-populates the "transcript" field for each audio task (so the annotators only need to correct it).

import prodigy
from prodigy.components.loaders import AudioServer
from prodigy.util import split_string

@prodigy.recipe(
    "audio-transcript",
    dataset=("The dataset to save annotations to", "positional", None, str),
    source=("Directory of audio files", "positional", None, str)
)
def audio_transcript(dataset: str, source: str):
    stream = AudioServer(source)
    blocks = [
        {"view_id": "audio"},
        {"view_id": "text_input", "field_rows": 2, "field_label": "Transcript", "field_id": "transcript"},
    ]
    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "blocks",
        "config": {"blocks": blocks},
    }

Example 4: Alignment of text and audio

Highlight spans in the text that correspond to the audio, and vice versa. How you set this up depends on your requirements: you can load in already existing annotated regions in the text ("spans") or audio ("audio_spans"), or do both from scratch. For instance, if you have text with existing spans for disfluencies, you could ask the annotator to select the corresponding regions in the audio.

import prodigy
from prodigy.components.loaders import AudioServer
from prodigy.components.preprocess import add_tokens
from prodigy.util import split_string
import spacy

TRANSCRIPTS = {
    "/path/to/file1.mp3": "This is a transcript...",
    "/path/to/file2.mp3": "This is another transcript..."
}

@prodigy.recipe(
    "audio-alignment",
    dataset=("The dataset to save annotations to", "positional", None, str),
    source=("Directory of audio files", "positional", None, str), 
    label=("One or more comma-separated labels", "option", "l", split_string),
    lang=("Language for text tokenization", "option", "ln", str),
)
def audio_alignment(dataset: str, source: str, label: list = [], lang: str = "en"):
    def get_stream():
        stream = AudioServer(source)
        for eg in stream:
             # Get transcript for the audio file
             if eg["path"] in TRANSCRIPTS:
                 eg["text"] = TRANSCRIPTS[eg["path"]]
                 yield eg
    
    nlp = spacy.blank(lang)
    stream = get_stream()
    stream = add_tokens(nlp, stream)  # add tokens for manual text highlighting

    blocks = [
        {"view_id": "audio_manual", "labels": label}, 
        {"view_id": "ner_manual", "labels": label}
    ]
    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "blocks",
        "config": {"blocks": blocks},
    }

:warning: Known problems / open questions

  • If you go back and forth (submit and undo) very fast, it may cause the audio loading to fail and existing regions won't be drawn correctly because there's no audio loaded. It's usually solved by going back and forth again. I've been thinking about adding a "reload" button that completely reloads the current annotation card and audio, in case something goes wrong (but I haven't found a nice solution for this yet).
  • Resizing is a bit fiddly and you need to hit the exact region boundary. The cursor will then become a "resize" cursor (as opposed to the "move" cursor).
  • The loader currently only selects .wav and .mp3 files. Are there any other formats it should support?

Also: I need a cool test audio that we can use for the docs later on – ideally not too long, with different speakers, maybe disfluencies etc. And with a suitable license (public domain, CC etc.). Any ideas or suggestions? :slightly_smiling_face:

1 Like

HI @ines,

Would this audio annotator support live transcription? That is, live streaming of audio, and providing real-time transcription.

Thanks.

It could, but I'm not sure it'd be that useful? For annotation, especially if the goal is creating training data, you typically want the data collection process to be reliable and reproducible. There's way too much variance in live streams and it makes the whole process unpredictable. The focus of any audio workflows will definitely be creating training data for machine learning, not doing real-time transcription for other processes.

Thanks Ines! I think as currently this can be for transcribing short utterances. For longer audio files (e.g. conversations), many would like to transcribe and submit each segment/utterance/sentence one after one, which makes the whole task streamlining instead of having to transcribing and submitting the whole audio file at once. Another rather important feature is automatically adding the time stamps for the transcribed utterances/phrases (this is what we need actually).

I think Prodigy would meet the requirements of many who prefer to have something that is more user friendly and not as complex as Praat (http://www.fon.hum.uva.nl/praat/).

Yes, I agree – I think smaller chunks are more efficient to work with. Cutting up the files is probably a better fit for the Python / recipe level? For instance, you could use a Python audio library to automatically cut the files (e.g. on X milliseconds or more of silence), log the start / end positions, save the resulting snippets out to individual files and then stream in that directory.

Thanks for this new feature, this is great !
I'am currently experimenting taging portions of audio which contain last names. Sometimes this portions are very short and it is difficult to be accurate. It would be nice if we could zoom in on the wave.

@Marie Thanks, that's nice to hear! I'll look into options for zooming, that's a good point.

Other features I'll be working on:

  • try and improve loading times for files by using a native audio player under the hood
  • add support for base64-encoded audio data (faster loading, allows storing data with the annotations). I had initially ruled this out because I thought audio data was too big, but turns out it's actually totally viable for compressed audio
  • assign different colours to regions based on the label (just like the bounding boxes in image_manual etc.) – at the moment, it only supports pre-defined "color" values on the audio spans
  • allow changing label of existing region (without having to re-add it)

We're also currently exploring options for different audio workflows for semi-automatic annotation and using active learning with a model in the loop, e.g. for speaker diarization. I'll hopefully be able to share more on this soon :slightly_smiling_face: