✨ Audio annotation UI (beta)

I posted a teaser of this on Twitter the other day :smiley: The prototype worked well enough that we decided to ship it as a "secret" beta feature in the latest v1.9.4. So if you're working with audio and want to test it, it's super easy now.

Requirements

  • Latest Prodigy v1.9.4 (released today, December 28)
  • A directory of audio files (.mp3 or .wav)

How it works

New interfaces: There are two new interfaces: audio (display an audio file with optional pre-defined, non-editable regions) and audio_manual (display an audio file and allow user to draw, edit or remove regions for one or more labels).

New loader: Prodigy also ships a new loader, audio-server, which serves the audio files via the local Prodigy web server so they can be loaded in the app. Each task it creates (and that's later saved with the annotations in the database) also includes the original file path and file name. Instead of using the audio server loader, you can of course also load in a JSONL file where each task specifies a live URL for "audio" (just no local paths, since your browser will block that).

UI settings: You can toggle play/pause using the enter key. I've found that I often want to use the space bar (maybe because video editing tools do it this way, too?) so you might want to update the "keymap" in your prodigy.json or recipe config to remap the keys: "keymap": {"playpause": ["space"], "ignore": ["enter"]}

The audio UI suports the following settings: "show_audio_cursor": true (display a line at the position of the cursor), "show_audio_timeline": false (show a timeline) and "audio_minimap": true (show a minimap of the whole file).

Data format: Audio tasks need to contain an "audio" key and labelled regions are represented as "audio_spans" with a "start" and "end" (timestamp in seconds), a "label", an optional "color" and an "id".

Example 1: Manual audio annotation

Stream in audio files from a directory and annotate regions in them using the given labels. Use cases: diarization (speaker identification), selecting noise, disfluencies etc.

prodigy mark audio_dataset "/path/to/audios" --loader audio-server --label LABEL1,LABEL2 --view-id audio_manual

Example 2: Binary audio annotation

Stream in audio files from a directory and collect binary annotations for a given (optional) label. Use cases: binary audio classification, data curation, etc.

prodigy mark audio_dataset "/path/to/audios" --loader audio-server --label GOOD_QUALITY --view-id audio

Example 3: Manual transcript

Load in audio files from a directory and ask the user to transcribe the audio manually. If you already have transcripts, you could also write a stream generator that pre-populates the "transcript" field for each audio task (so the annotators only need to correct it).

import prodigy
from prodigy.components.loaders import AudioServer
from prodigy.util import split_string

@prodigy.recipe(
    "audio-transcript",
    dataset=("The dataset to save annotations to", "positional", None, str),
    source=("Directory of audio files", "positional", None, str)
)
def audio_transcript(dataset: str, source: str):
    stream = AudioServer(source)
    blocks = [
        {"view_id": "audio"},
        {"view_id": "text_input", "field_rows": 2, "field_label": "Transcript", "field_id": "transcript"},
    ]
    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "blocks",
        "config": {"blocks": blocks},
    }

Example 4: Alignment of text and audio

Highlight spans in the text that correspond to the audio, and vice versa. How you set this up depends on your requirements: you can load in already existing annotated regions in the text ("spans") or audio ("audio_spans"), or do both from scratch. For instance, if you have text with existing spans for disfluencies, you could ask the annotator to select the corresponding regions in the audio.

import prodigy
from prodigy.components.loaders import AudioServer
from prodigy.components.preprocess import add_tokens
from prodigy.util import split_string
import spacy

TRANSCRIPTS = {
    "/path/to/file1.mp3": "This is a transcript...",
    "/path/to/file2.mp3": "This is another transcript..."
}

@prodigy.recipe(
    "audio-alignment",
    dataset=("The dataset to save annotations to", "positional", None, str),
    source=("Directory of audio files", "positional", None, str), 
    label=("One or more comma-separated labels", "option", "l", split_string),
    lang=("Language for text tokenization", "option", "ln", str),
)
def audio_alignment(dataset: str, source: str, label: list = [], lang: str = "en"):
    def get_stream():
        stream = AudioServer(source)
        for eg in stream:
             # Get transcript for the audio file
             if eg["path"] in TRANSCRIPTS:
                 eg["text"] = TRANSCRIPTS[eg["path"]]
                 yield eg
    
    nlp = spacy.blank(lang)
    stream = get_stream()
    stream = add_tokens(nlp, stream)  # add tokens for manual text highlighting

    blocks = [
        {"view_id": "audio_manual", "labels": label}, 
        {"view_id": "ner_manual", "labels": label}
    ]
    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "blocks",
        "config": {"blocks": blocks},
    }

:warning: Known problems / open questions

  • If you go back and forth (submit and undo) very fast, it may cause the audio loading to fail and existing regions won't be drawn correctly because there's no audio loaded. It's usually solved by going back and forth again. I've been thinking about adding a "reload" button that completely reloads the current annotation card and audio, in case something goes wrong (but I haven't found a nice solution for this yet).
  • Resizing is a bit fiddly and you need to hit the exact region boundary. The cursor will then become a "resize" cursor (as opposed to the "move" cursor).
  • The loader currently only selects .wav and .mp3 files. Are there any other formats it should support?

Also: I need a cool test audio that we can use for the docs later on – ideally not too long, with different speakers, maybe disfluencies etc. And with a suitable license (public domain, CC etc.). Any ideas or suggestions? :slightly_smiling_face:

3 Likes

HI @ines,

Would this audio annotator support live transcription? That is, live streaming of audio, and providing real-time transcription.

Thanks.

It could, but I'm not sure it'd be that useful? For annotation, especially if the goal is creating training data, you typically want the data collection process to be reliable and reproducible. There's way too much variance in live streams and it makes the whole process unpredictable. The focus of any audio workflows will definitely be creating training data for machine learning, not doing real-time transcription for other processes.

Thanks Ines! I think as currently this can be for transcribing short utterances. For longer audio files (e.g. conversations), many would like to transcribe and submit each segment/utterance/sentence one after one, which makes the whole task streamlining instead of having to transcribing and submitting the whole audio file at once. Another rather important feature is automatically adding the time stamps for the transcribed utterances/phrases (this is what we need actually).

I think Prodigy would meet the requirements of many who prefer to have something that is more user friendly and not as complex as Praat (http://www.fon.hum.uva.nl/praat/).

Yes, I agree – I think smaller chunks are more efficient to work with. Cutting up the files is probably a better fit for the Python / recipe level? For instance, you could use a Python audio library to automatically cut the files (e.g. on X milliseconds or more of silence), log the start / end positions, save the resulting snippets out to individual files and then stream in that directory.

Thanks for this new feature, this is great !
I'am currently experimenting taging portions of audio which contain last names. Sometimes this portions are very short and it is difficult to be accurate. It would be nice if we could zoom in on the wave.

@Marie Thanks, that's nice to hear! I'll look into options for zooming, that's a good point.

Other features I'll be working on:

  • try and improve loading times for files by using a native audio player under the hood
  • add support for base64-encoded audio data (faster loading, allows storing data with the annotations). I had initially ruled this out because I thought audio data was too big, but turns out it's actually totally viable for compressed audio
  • assign different colours to regions based on the label (just like the bounding boxes in image_manual etc.) – at the moment, it only supports pre-defined "color" values on the audio spans
  • allow changing label of existing region (without having to re-add it)

We're also currently exploring options for different audio workflows for semi-automatic annotation and using active learning with a model in the loop, e.g. for speaker diarization. I'll hopefully be able to share more on this soon :slightly_smiling_face:

Thanks for the update and this cool feature Ines. I'm just dropping a note to let you know my company and I will be trying this feature out over the coming month. I'll have to post back with details on how that goes. Our specific possible implementation has to do with Ag-Tech and saving piglet lives.

2 Likes

Thanks @ines his is a great feature to include in Prodigy, thank you.

In version 1.9.9 though there seems to be a bug. If i run your Manual audio annotation example as indicated:

prodigy mark audio_dataset "/path/to/audios" --loader audio-server --label LABEL1,LABEL2 --view-id audio_manual

The first audio will load properly as seen here:

As I mark the green button to go to the next audiowave disappears, so I lose the capacity of doing the manual audio annotation.

Only if I reload the browser will this work properly again, but only for the first audio file loaded.

Any pointers?

Changing the batch_size in prodigy.json to 1 resolves the issue i mention above.

@cesarandreslopez Thanks for testing the interface and the feedback. It sounds like this is related to the loading process crashing somewhere in between :thinking: The next version of Prodigy will also provide an Audio loader that converts the audio data to base64, which should solve the loading issues – and it'll also an option to drop that from the examples before they're saved in the database (to prevent bloat).

(If your files aren't too big, you can already try and stream in base64-encoded strings as the "audio" value. It should already work out-of-the-box – you'll just end up with the data in your database, which may take up quite a bit of space.)

Hi @ines, this is a great feature for Prodigy that matches with my project. I would like to know if it supports stereo audio and how does it work in that case? I mean, can I label audio regions in each channel and also align text to the corresponding audio region in each channel?
Thanks!

That's an interesting use case. So how would this look in practice? Would you want to see both channels separately and also be able to annotate them separately, or at least, align the regions differently?

Displaying split channels is definitely no problem, but I'm not sure how easy it'd be to allow per-channel region annotations.

Thanks for your answer.

Yes, in case of a stereo call, if we see both channels separately and then we can annotate them separately, it will allow us to have for example the label associated to the audio and the channel.

For example, this could be useful to label sentiment for each speaker, label incorrect words and provide its correct transcription indicating the channel that contains that information.

Ah, so you'd have different content on the different channels? At the moment, you'd have to split the channels and stream them in as separate tasks in order. You can still add custom meta information to the tasks that tell you which task is related to which file and channel.

2 Likes

Hi @ines ,
First of all thanks for this feature, this feature is the reason my company bought a Prodigy license as we work with audio.

I wanted to communicate feedback from our use case in case it's useful and in the hopes that we might get the features we need :slight_smile:

We use the audio interface as part of a custom recipe for human-in-the-loop speech segmentation. Our most sorely needed feature is to display an accompanying spectrogram. Along with that comes the need to increase the size of the component to take 70-80% of the screen space so we can see the spectrogram in more detail. Ironically, the fastest and most accurate way to manually "search" speech data is not to listen to it but "read" the spectrogram, and its presence would cut down our annotation time easily 4-5x.

I've found that wavesurfer.js has a spectrogram plugin which I guess would be the obvious point to start for adding this feature. I've tried to figure out if I can inject this through custom CSS or JS, but I couldn't figure out on short notice. Do you think there is a simple way I can do this on our side?

The fiddly handles, as you mentioned are a major problem. For now, every time we load the interface, we go to browser Inspect tool and set width to 5px in

.c0177 .wavesurfer-handle {
  width: 5px !important;
  background-color: currentColor !important;
}

There's probably a bookmarklet that can be hacked together for this, as a user-side solution.

Also, if there is a way to have the audio display not follow playback (at higher zoom levels) would be great. Sometimes you want to keep your view fixed, but listen to the audio without worrying about the cursor getting away from you and shifting your view.

And below are the rest of our laundry list of feature requests :slight_smile:

  • Adjust active selection by keyboard - move whole selection left/right, extend/shrink only left/right
    • We don't do multiple segments, but I imagine for people that do, keyboard bindable "cycle active segment" would be useful, in conjunction with above feature.
  • Undo stack for the segment edits
  • Jump cursor forward/backward some fixed amount (e.g. 15 sec)
  • Option to disable the time display that follows the mouse cursor - or at least have it not appear in the middle. It's covering the waveform display right where I'm trying to look!

Out of these probably keyboard bindings for adjusting segments is the most important one.

I'm not sure how much demand you've seen for this component and not sure how actively it's being developed, but it would be really useful for us if it were more mature.

Thank you!

That's an interesting request. I found this user post that might be relevant:

But it's not something we support natively just yet. I will add a ticket on our end to explore it. No promises on a date, but I agree it's something we can explore. I can totally see how a spectrogram might be a more convenient UI.

Just to get some feedback, would it be appropriate if the spectrogram would replace the current selection view?

1 Like

Thank you for your reply and the ticket @koaning !

Just to get some feedback, would it be appropriate if the spectrogram would replace the current selection view?

If we'd have to select between the two, I'd take spectrogram no question. Having said that, having the waveform as an additional display is helpful for judging the energy (i.e. volume level) of the audio at that time instant, which is not that obvious in the spectrogram. This sometimes helps with the annotation, but is not super critical.

1 Like

I have been digging through the spectrogram idea (within this forum) as of lately. I think that having it as a native plugin would be excellent.

This is quicker for us users, as I have difficulties understanding just how one adds the spectrogram view via the Wavesurfer plugin (where to write what code). In the linked post, the user speaks about adding the custom JS to the recipe, but I'd love to have a more detailed instruction as to which parts to modify (within recipe.py and in other files, too).

---Edit---
After some more digging, I found this post, detailing the how-to: https://support.prodi.gy/t/custom-view-templates-with-scripts/302/19. I am leaving it here so others can include the spectrograms more easily.

1 Like

Hi, I too would very much like that feature, is something like this available? I tried customizing pyannote's own recipes, from github but it got a little hard, it seems possible, but way to much javascript code, and they kind of seem outdated, like 2 years old javascript modules. Would prodigy have some example without 2 audio sources, but second audio source included just in meta? Both audio sources would be the same file, but different channels.

I too would like an option to mute the other channel, but I can do this with window.wavesurfer, just got make sure it would be available.

EDIT:
Partial answer to my own question available at: