I posted a teaser of this on Twitter the other day The prototype worked well enough that we decided to ship it as a "secret" beta feature in the latest v1.9.4. So if you're working with audio and want to test it, it's super easy now.
Requirements
- Latest Prodigy v1.9.4 (released today, December 28)
- A directory of audio files (
.mp3
or.wav
)
How it works
New interfaces: There are two new interfaces: audio
(display an audio file with optional pre-defined, non-editable regions) and audio_manual
(display an audio file and allow user to draw, edit or remove regions for one or more labels).
New loader: Prodigy also ships a new loader, audio-server
, which serves the audio files via the local Prodigy web server so they can be loaded in the app. Each task it creates (and that's later saved with the annotations in the database) also includes the original file path and file name. Instead of using the audio server loader, you can of course also load in a JSONL file where each task specifies a live URL for "audio"
(just no local paths, since your browser will block that).
UI settings: You can toggle play/pause using the enter key. I've found that I often want to use the space bar (maybe because video editing tools do it this way, too?) so you might want to update the "keymap"
in your prodigy.json
or recipe config to remap the keys: "keymap": {"playpause": ["space"], "ignore": ["enter"]}
The audio UI suports the following settings: "show_audio_cursor": true
(display a line at the position of the cursor), "show_audio_timeline": false
(show a timeline) and "audio_minimap": true
(show a minimap of the whole file).
Data format: Audio tasks need to contain an "audio"
key and labelled regions are represented as "audio_spans"
with a "start"
and "end"
(timestamp in seconds), a "label"
, an optional "color"
and an "id"
.
Example 1: Manual audio annotation
Stream in audio files from a directory and annotate regions in them using the given labels. Use cases: diarization (speaker identification), selecting noise, disfluencies etc.
prodigy mark audio_dataset "/path/to/audios" --loader audio-server --label LABEL1,LABEL2 --view-id audio_manual
Example 2: Binary audio annotation
Stream in audio files from a directory and collect binary annotations for a given (optional) label. Use cases: binary audio classification, data curation, etc.
prodigy mark audio_dataset "/path/to/audios" --loader audio-server --label GOOD_QUALITY --view-id audio
Example 3: Manual transcript
Load in audio files from a directory and ask the user to transcribe the audio manually. If you already have transcripts, you could also write a stream generator that pre-populates the "transcript"
field for each audio task (so the annotators only need to correct it).
import prodigy
from prodigy.components.loaders import AudioServer
from prodigy.util import split_string
@prodigy.recipe(
"audio-transcript",
dataset=("The dataset to save annotations to", "positional", None, str),
source=("Directory of audio files", "positional", None, str)
)
def audio_transcript(dataset: str, source: str):
stream = AudioServer(source)
blocks = [
{"view_id": "audio"},
{"view_id": "text_input", "field_rows": 2, "field_label": "Transcript", "field_id": "transcript"},
]
return {
"dataset": dataset,
"stream": stream,
"view_id": "blocks",
"config": {"blocks": blocks},
}
Example 4: Alignment of text and audio
Highlight spans in the text that correspond to the audio, and vice versa. How you set this up depends on your requirements: you can load in already existing annotated regions in the text ("spans"
) or audio ("audio_spans"
), or do both from scratch. For instance, if you have text with existing spans for disfluencies, you could ask the annotator to select the corresponding regions in the audio.
import prodigy
from prodigy.components.loaders import AudioServer
from prodigy.components.preprocess import add_tokens
from prodigy.util import split_string
import spacy
TRANSCRIPTS = {
"/path/to/file1.mp3": "This is a transcript...",
"/path/to/file2.mp3": "This is another transcript..."
}
@prodigy.recipe(
"audio-alignment",
dataset=("The dataset to save annotations to", "positional", None, str),
source=("Directory of audio files", "positional", None, str),
label=("One or more comma-separated labels", "option", "l", split_string),
lang=("Language for text tokenization", "option", "ln", str),
)
def audio_alignment(dataset: str, source: str, label: list = [], lang: str = "en"):
def get_stream():
stream = AudioServer(source)
for eg in stream:
# Get transcript for the audio file
if eg["path"] in TRANSCRIPTS:
eg["text"] = TRANSCRIPTS[eg["path"]]
yield eg
nlp = spacy.blank(lang)
stream = get_stream()
stream = add_tokens(nlp, stream) # add tokens for manual text highlighting
blocks = [
{"view_id": "audio_manual", "labels": label},
{"view_id": "ner_manual", "labels": label}
]
return {
"dataset": dataset,
"stream": stream,
"view_id": "blocks",
"config": {"blocks": blocks},
}
Known problems / open questions
- If you go back and forth (submit and undo) very fast, it may cause the audio loading to fail and existing regions won't be drawn correctly because there's no audio loaded. It's usually solved by going back and forth again. I've been thinking about adding a "reload" button that completely reloads the current annotation card and audio, in case something goes wrong (but I haven't found a nice solution for this yet).
- Resizing is a bit fiddly and you need to hit the exact region boundary. The cursor will then become a "resize" cursor (as opposed to the "move" cursor).
- The loader currently only selects
.wav
and.mp3
files. Are there any other formats it should support?
Also: I need a cool test audio that we can use for the docs later on – ideally not too long, with different speakers, maybe disfluencies etc. And with a suitable license (public domain, CC etc.). Any ideas or suggestions?