Multi-stage speaker audio classification with `pyannote.sad.manual` and `audio manual`

I would like to do multi-class audio annotation using the audio.manual recipe. However, I would like to first establish speech and non-speech regions using the pyannote.sad.manual model-in-the-loop annotation.

I found the above thread in the forum suggesting that I could potentially get labels into the database using the pyannote.sad.manual, and then go in and update those initial SPEECH zone labels with the specific classes of interest (e.g., MALE, FEMALE, CHILD, etc.). Is there documentation or guidance available as to how to do that?

Hi! The specifics depend on the exact label scheme and what's most efficient. One option would be to create a custom recipe using a custom interface with two blocks: an audio block and a choice block with the different category options. For each of your existing annotations, you can create a new example (one per SPEECH region) and add the option, to it, so you can ask about each region specifically.

The stream generating the data could look like this:

from prodigy.components.db import connect
import copy

# The options to choose from
options = [{"id": "ADULT", "text": "Adult"}, {"id": "CHILD", "text": "Child"}]

def get_stream():
    db = connect()
    # Load your already annotated data
    examples = db.get_dataset("your_existing_speech_dataset")
    for eg in examples:
        # These are the annotated regions in the example 
        audio_spans = eg.get("audio_spans", [])
        for span in audio_spans:
            # Create a new example for each annotated span, so you
            # can select one category per span โ€“ make sure to deepcopy!
            new_eg = copy.deepcopy(eg)
            new_eg["audio_spans"] = [span]
            new_eg["options"] = options
            yield new_eg

Your blocks config would then be:

"blocks": [{"view_id": "audio"}, {"view_id": "choice"}]

The only tricky thing in this case is that it's audio data, which can get quite large. To send the audio data to the server, Prodigy typically encodes it to base64 and then removes the base64 string before the data is saved to the database (and only keeps a path reference to the original file path). So when you re-annotate the data, you'll need to fetch the audio data back from the path. Prodigy comes with a fetch_media helper that does exactly that:

from prodigy.components.preprocess import fetch_media

stream = get_stream()  # your custom stream, see above
stream = fetch_media(stream, ["audio"])  # replace all "audio" keys with base64

(This also means you probably want to do your own chunking if you want to be able to re-annotate examples without having to split the original file on the timestamps again.)

1 Like

Here's my attempt at building out the above recipe. Unfortunately, I'm getting "No tasks available" whenever I fire it up.

import copy
from typing import List

import prodigy
from prodigy.components.db import connect
from prodigy.components.preprocess import fetch_media
from prodigy.util import get_labels

options = [
    {'id': 'FEM', 'text': 'Female'},
    {'id': 'MAL', 'text': 'Male'},
    {'id': 'CHI(1)', 'text': 'Child (single)'},
    {'id': 'CHI(2p)', 'text': 'Children (plural)'},
    ]

@prodigy.recipe(
    "post-sad-multiclass",
    dataset=("The dataset to use", "positional", None, str),
    source=("Source dir containing audio files", "positional", None, str),
    label=("Comma-separated label(s)", "option", "l", get_labels),
    # silent=("Don't output anything", "flag", "S", bool)
    )
def multiclass_audio(
        dataset: str,
        label: List[str],
        source: str):
    """ ...put in a useful docstring here at some point...
    """

    def get_stream():
        # Load the directory of audio files and add options to each task
        db = connect()
        # Load your already annotated data
        examples = db.get_dataset(dataset)
        for eg in examples:
            # These are the annotated regions in the example
            audio_spans = eg.get("audio_spans", [])
            for span in audio_spans:
                # Create a new example for each annotated span, so you
                # can select one category per span - make sure to deepcopy!
                new_eg = copy.deepcopy(eg)
                new_eg["audio_spans"] = [span]
                new_eg["options"] = options
            yield new_eg

    with open('/path/to/my/custom/html/multiclass-audio-template.html', 'r') as f:
        html_template = f.read()

    blocks = [
        {'view_id': 'html', 'html_template': html_template}, # Because I want to be able to speed up / slow down the playback speed...
        {'view_id': 'audio'},
        {'view_id': 'choice'}

    ]

    stream = get_stream()  # your custom stream, see above
    stream = fetch_media(stream, ["audio"])  # replace all "audio" keys with base64

    return {
        'dataset': dataset,
        'stream': stream,
        'view_id': 'blocks',
        'config': {
            'blocks': blocks,
            'audio_autoplay': False,
            'audio_bar_gap': 1,
            'audio_bar_height': 1,
            'audio_bar_radius': 1,
            'audio_bar_width': 2,
            'audio_loop': False,
            'audio_max_zoom': 5000,
            'audio_rate': 1.0,
            'show_audio_cursor': True,
            'show_audio_cursor_time': True,
            'show_audio_minimap': True,
            'show_audio_timeline': True,
            'force_stream_order': True,
            'labels': ['FEM', 'MAL', 'CHI(1)', 'CHI(2p)'],
            'custom_theme': {
                'labels': {
                    'FEM': '#84E9F3',
                    'MAL': '#4E6BF3',
                    'CHI(1)': '#F2B11A',
                    'CHI(2p)': '#852215',
                }
            }
        }
    }

I'm invoking the above like so:

prodigy post-sad-multiclass engage_ny ./data/training_inputs/audio -F ../prodigy_custom/post-sad-multiclass.py

after having previously invoked the pyannote.sad.manual task like so:

prodigy pyannote.sad.manual engage_ny ../data/audio_files/wav

So, what I think is happening:

  1. I run pyannote.sad.manual to get an initial cut at speech regions.
  2. When I click through and accept (or modify) those speech regions, they are being stored in memory
  3. When I hit the save button, they get written to a database called, in my case, engage_ny
  4. When I finish up, I should be able to load in those annotations from the engage_ny database by having 'engage_ny' be the parameter I pass in to the db.get_dataset() call
  5. Those records should now be available to me to re-annotate with multi-label classifications rather than simply speech/no speech.
  6. When I save the file again, I will now have both the original annotation (speech/no speech) and the new annotation (multi-label) associated with a given segment of my audio.

Is all of that correct?

Two nuances I wanted to flag, and I wasn't sure how they would play out:

  1. I would prefer to set this up with an audio_manual interface rather than just audio, because a given SPEECH region can have overlapping voices, and I need to articulate the boundaries of the overlap. I think the original suggestion you provided gives me a forced-choice single-label-per-speech-region setup, correct?
  2. I wasn't sure I understood the setup properly. Will this recipe ONLY show me the regions of my audio which the pyannote.sad.manual had identified as speech, filtering out everything else? Or will I get the same chunks / snippet durations? (I wasn't quite sure what the audio_spans object was going to wind up containing...whether it would have the original chunks of audio, or whether they would only have those which had been tagged as SPEECH.)

I can verify that after invoking PRODIGY_LOGGING=verbose and forcing refresh in the browser to try to queue up new tasks, I wind up getting the same error as noted in this ticket: No tasks available with custom image recipe

image

These are the upstream outputs of the verbose logging:

Hi! I think the problem here is that you're using the same dataset for both annotation runs and loading the same examples back in, with the same hashes. As a result, Prodigy will skip all examples that are already annotated and present in the data (so you're not asked the same question twice) and your resulting stream will be empty.

From what you describe, you probably want to be using a different dataset name so you have a cleaner separation between the different types of annotations you collect? If you ever do want to add modified examples to the same set, you can use the set_hashes helper to re-hash the incoming examples, or assign your own hashes manually so Prodigy knows they're in fact different questions.

The audio_spans will contain the regions highlighted in the UI so your recipe will create one new question per speech regio.

Thanks as ever for the quick responses!

Ahh, I thought I needed to pass in the same argument to prodigy (in re: dataset name) as I needed to hook up to the DB. If I'm hearing you right, what I'd actually want is the CLI to have my target ("after this stage of the annotation pipeline") name - e.g., engage_NY_multi - whereas I'd pass in the name of the source dataset to the db.connect() - e.g., engage_NY.

I think you're saying it's not that I want to work within the same database for both, with the new annotations just being an extra set of columns for the existing hashes - which is what I thought I was doing - but rather that it's actually a second database entirely.

Is that correct?

So something like

Invoked like

prodigy post-sad-multiclass engage_ny -T engage_ny_multi -F ../prodigy_custom/post-sad-multiclass.py

...or something?

I guess this is what I was concerned about - the speech regions as demarcated by pyannote.sad.manual aren't perfect, obviously, so there's some correction to be done.

If I'm understanding you properly - what I'd be getting is an audio segment whose boundaries are set by the timestamps of the entry in the database. I will be able to label that segment with my new class, for instance, but I wouldn't necessarily be able to expand its boundaries or contract it. Is that correct?

:rofl: :sweat_smile: SO CLOSE!!

...and yet so far. So I think what I'm seeing is that my issue is right here:

Rather than pulling out the spans - which just have the timestamps - I need to figure out how to stream out the audio itself again. I think I should be able to reverse-engineer the pyannote.sad.manual recipe to get there...

You can think of datasets as "collections" of annotations โ€“ so you'd typically want to collect the data from different experiments in different datasets. This will also prevent Prodigy from skipping examples that are already in the set โ€“ which is typically a very useful mechanism, but not if you're re-annotating existing examples.

Yes, you'll see the spans that already existing the data and if you're using a manual interface, you'll be able to edit them.

Yes, I think this is what I was trying to explain in my comment in the first post:

Thanks for the guidance. I attempted to follow your guidance below...

...using the following recipe:


@prodigy.recipe(
    "post-sad-multiclass",
    dataset=("The dataset to read from", "option", "d", str),
    target=("The dataset to save result in", "option", "t", str),
    source=("Source dir containing audio files", "option", "s", str),
    label=("Comma-separated label(s)", "option", "l", get_labels),
    # quiet=("Don't output anything", "flag", "q", bool)
    )
def multiclass_audio(
        dataset: str,
        target: str,
        label: List[str],
        source: str):
    """ """

    def get_stream():
        # Load the directory of audio files and add options to each task
        prodigy.log('Instantiating DB connection')
        db = connect()
        # Load your already annotated data
        prodigy.log(f'Connecting to DB {dataset}')
        records = db.get_dataset(dataset)
        prodigy.log(f'Labels to apply: {[o for o in options]}')
        for rec in records:
            # These are the annotated regions in the example
            audio_spans = Audio(target)
            for span in audio_spans:
                # Create a new example for each annotated span, so you
                # can select one category per span - make sure to deepcopy!
                new_rec = copy.deepcopy(rec)
                new_rec["audio_spans"] = [span]
                new_rec["options"] = options
                yield new_rec

    with open('/Users/tsslade/Dropbox/BerkeleyMIDS/projects/w210_capstone/prodigy_custom/multiclass-audio-template.html', 'r') as f:
        html_template = f.read()

    prodigy.log('Defining blocks')
    blocks = [
        {'view_id': 'html', 'html_template': html_template},
        {'view_id': 'audio_manual'},
        # {'view_id': 'choice'}
    ]

    prodigy.log('Instantiating stream')
    stream = get_stream()  # your custom stream, see above
    stream = fetch_media(stream, ["audio"])  # replace all "audio" keys with base64


    return {
        'dataset': target,
        'stream': stream,
        'view_id': 'blocks',
        'config': {
            'blocks': blocks,
            'audio_autoplay': False,
            'audio_bar_gap': 0,
            'audio_bar_height': 2,
            'audio_bar_radius': 1,
            'audio_bar_width': 1,
            'audio_loop': False,
            'audio_max_zoom': 5000,
            'audio_rate': 1.0,
            'show_audio_cursor': True,
            'show_audio_cursor_time': True,
            'show_audio_minimap': True,
            'show_audio_timeline': True,
            'force_stream_order': True,
            'labels': ['FEM', 'MAL', 'CHI(1)', 'CHI(2p)'],
            'custom_theme': {
                'labels': {
                    'FEM': '#84E9F3',
                    'MAL': '#4E6BF3',
                    'CHI(1)': '#F2B11A',
                    'CHI(2p)': '#852215',
                }
            }
        }
    }

...I wind up with a blank space where the waveform ought to be, as I posted in my earlier comment.

Is this something to do with the way the new_eg["audio_spans"] gets handled? Since I'm using the pyannote.sad.manual with chunk size of 10 as my first-line preprocessing, is it possible that the fetch_media(stream,["audio"]) is confused because it wants to look for a subset of an audio file as a distinct entity, but it's only finding the full audio files without the pyannote chunkingalready applied?

(I wasn't sure if that's what you were getting at in the "do your own chunking" statement, or something else.)

Yes, this is what I meant by "do your own chunking" โ€“ sorry if this was confusing! The problem is that if you let some other process (e.g. the pyannote recipe) chunk up the audio and the remove the base64-encoded data from the annotations, you'll lose that data. So fetching the media can't help you here either because the chunks never existed as files โ€“ you just have the timestamps and potentially a reference to the original file.

Storing the whole audio data with your annotations is likely impractical, so the best logical solution would be to create the chunks yourself and store them as separate files so you don't lose the reference to the data. (The alternative would be to load the original file back in, split it with the given timestamps, encode the result to base64 and then send that out โ€“ but that's going to be a lot more work.)

So I took your advice and went ahead and did the chunking myself - 30-s. chunks in pyannote.sad.manual, where 30s is the full length of the input audio file. I'm still running into issues. I've uploaded the verbose log from this; the recipe is as follows.

import copy
import os
from pathlib import Path
from typing import List

import prodigy
from prodigy.components.db import connect
from prodigy.components.loaders import Audio
from prodigy.components.preprocess import fetch_media
from prodigy.util import get_labels, log, msg

# HERE = os.getcwd()
# prodigy.log(f'Here: {HERE}')
# AUDIO_FOLDER = '../data/audio_files/wav/'
# prodigy.log(f'Audio folder: {AUDIO_FOLDER}')

def remove_base64(examples):
    """Remove base64-encoded string if "path" is preserved in example."""
    for eg in examples:
        if "audio" in eg and eg["audio"].startswith("data:") and "path" in eg:
            eg["audio"] = eg["path"]
        if "video" in eg and eg["video"].startswith("data:") and "path" in eg:
            eg["video"] = eg["path"]
    return examples


options = [
    {'id': 'FEM', 'text': 'Female'},
    {'id': 'MAL', 'text': 'Male'},
    {'id': 'CHI(1)', 'text': 'Child (single)'},
    {'id': 'CHI(2p)', 'text': 'Children (plural)'},
]

@prodigy.recipe(
    "post-sad-multiclass",
    dataset=("The dataset to read from", "option", "d", str),
    target=("The dataset to save result in", "option", "t", str),
    source=("Source dir containing audio files", "option", "s", str),
    label=("Comma-separated label(s)", "option", "l", get_labels),
    # quiet=("Don't output anything", "flag", "q", bool)
    )
def multiclass_audio(
        dataset: str,
        target: str,
        label: List[str],
        source: str):
    """
    """

    def get_stream():
        # Load the directory of audio files and add options to each task
        prodigy.log('Instantiating DB connection')
        db = connect()
        # Load your already annotated data
        prodigy.log(f'Connecting to DB {dataset}')
        examples = db.get_dataset(dataset)
        # prodigy.log(f'Labels to apply: {[o for o in options]}')
        for eg in examples:
            audio_spans = eg.get("audio_spans", [])
            for span in audio_spans:
                # Create a new example for each annotated span, so you
                # can select one category per span - make sure to deepcopy!
                new_eg = copy.deepcopy(eg)
                new_eg["audio_spans"] = [span]
                new_eg["options"] = options
                yield new_eg

    with open('C:/Users/tslade/projects/teacherprints/prodigy/multiclass-audio-template.html', 'r') as f:
        html_template = f.read()

    with open('C:/Users/tslade/projects/teacherprints/prodigy/timestretcher.js', 'r') as f:
        javascript = f.read()


    prodigy.log('Defining blocks')
    blocks = [
        {'view_id': 'html', 'html_template': html_template},
        {'view_id': 'audio_manual'},
    ]

    prodigy.log('Instantiating stream')
    stream = get_stream()  # your custom stream, see above
    stream = fetch_media(stream, ["audio"])  # replace all "audio" keys with base64

    return {
        'dataset': target,
        'stream': stream,
        'view_id': 'blocks',
        'config': {
            'blocks': blocks,
            'javascript': javascript,
            'audio_autoplay': False,
            'audio_bar_gap': 0,
            'audio_bar_height': 2,
            'audio_bar_radius': 1,
            'audio_bar_width': 1,
            'audio_loop': False,
            'audio_max_zoom': 5000,
            'audio_rate': 1.0,
            'show_audio_cursor': True,
            'show_audio_cursor_time': True,
            'show_audio_minimap': True,
            'show_audio_timeline': True,
            'force_stream_order': True,
            'labels': ['FEM', 'MAL', 'CHI(1)', 'CHI(2p)'],
            'custom_theme': {
                'labels': {
                    'FEM': '#84E9F3',
                    'MAL': '#4E6BF3',
                    'CHI(1)': '#F2B11A',
                    'CHI(2p)': '#852215',
                }
            }
        }
    }

No waveform appears, suggesting to me that the media isn't being properly loaded. Checking the console, I see two errors - the latter of which is related to the timestretcher function, which is looking for a wavesurfer instance such as we discussed in the variable audio_rate for audio annotation support thread.


post-sad-multiclass-verbose-log.html (43.4 KB)

It appears from where the code is breaking that perhaps the window.wavesurfer object isn't getting created:

image

...and indeed, that object is not available in the console.

But if I forego the custom JS, the upstream problem remains:

...and appears to be related to the fetch_media(stream, ["audio"]) call not working properly.

image

    stream = get_stream()  # your custom stream, see above
    stream = fetch_media(stream, ["audio"])  # replace all "audio" keys with base64

That code was your suggestion, @ines, but I wasn't able to understand it - when I troubleshot by iterating through the stream and printing to console, I don't see an "audio" key in the dict. I see a path key, and it indeed contains the path to the audio files referenced by the annotation...but if I change the code to instead be

    stream = get_stream()  # your custom stream, see above
    stream = fetch_media(stream, ["path"])  # replace all "audio" keys with base64

I don't have any luck. And I've been unable to inspect the source code for the Audio(source) or the fetch_media() functions to understand how else the input they receive could be structured.

Ultimately, all the fetch_media helper does is load the file path, convert it to base64 and replace the given key with the base64-encoded data URI. There's very little magic otherwise. So you can also do this yourself โ€“ just convert the audio data to base64 and add it to your stream (and probably remove it in the before_db callback if you don't want to bloat your database).

There's also a helper function in Prodigy for the base64 conversion (it's super lightweight, though and just calls into base64): Components and Functions ยท Prodigy ยท An annotation tool for AI, Machine Learning & NLP