Stream videos from AWS S3 using boto3 with audio wav form displaying correctly

Hi, I'm trying to build a custom loader to stream in video data from S3 for a diarization task (based on the audio.manual recipe. I've based my approach on this answer.

I've adjusted it slightly by loading the videos from a signed_url which should behave exactly like a public URL. However, whilst the videos are loading the audio wav form is blank (see below). Since this is a diarization task, the wav form is critical.


Fig. 1: Wav form is missing.

This my current custom stream generator:

import boto3
from config import Config
import re
from botocore import client
from datetime import datetime

class S3Service(object):
    def __init__(
            self,
    ):
        self.bucket = Config.bucket_name
        self.s3 = self.get_s3()

    @staticmethod
    def get_s3():
        s3 = boto3.client(
            's3',
            aws_access_key_id=Config.aws_access_key_id,
            aws_secret_access_key=Config.aws_secret_access_key,
            config=client.Config(signature_version='s3v4')
        )
        return s3

    @staticmethod
    def get_s3_direct_file_regex(
            prefix
    ):
        if not prefix.endswith('/'):
            prefix += '/'
        escaped_subdir = re.escape(prefix)
        pattern = rf'^{escaped_subdir}[^/]+$'
        return re.compile(pattern)

    def generate_signed_url(
            self,
            object_key,
    ) -> str:
        signed_url = self.s3.generate_presigned_url(
            ClientMethod='get_object',
            Params={
                'Bucket': self.bucket,
                'Key': object_key,
            },
            ExpiresIn=Config.expires_in
        )
        return signed_url

    def stream_from_s3(
            self,
            file_type,
            prefix=None,
    ):
        paginator = self.s3.get_paginator('list_objects')
        paginate_params = {
            'Bucket': self.bucket
        }

        if prefix is not None:
            paginate_params['Prefix'] = prefix

        page_iterator = paginator.paginate(**paginate_params)
        pattern = self.get_s3_direct_file_regex(prefix)

        for page in page_iterator:
            for obj in page['Contents']:
                if pattern.match(obj.get("Key")):
                    object_key = obj['Key']
                    signed_url = self.generate_signed_url(
                        object_key,
                    )
                    annotation_element = {
                        file_type: signed_url,
                        'meta': {
                            "s3_object_key": f"s3://{self.bucket}/{object_key}",
                            "time_stamp": datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
                        }
                    }
                    yield annotation_element

This is called directly in a custom recipe as follows:

stream = s3_service.stream_from_s3(file_type=file_type, prefix=prefix)

Any advice to get the wav form to display from a URL? I've also experimented with reading the video and converting to base64 with both python native and prodigy converters, but such approaches have just hung indefinitely. Any help appreciated!

Hi @cbjrobertson ,

Nothing immediately stands out as wrong in your loader.
Could you try one of the signed URLs explicitly and make sure they're accessible via the browser?
Another source of the problem could be the file format. What is the video format that you're loading?
I recommend downloading one of these videos and see if you can load it from a local path without errors (you can specify the path to a local dir that contains the video as source argument of audio.manual. )
If the videos are longer you could try with the video server loader locally, instead.
Let us know how it goes!

Hi there --

Thanks for the reply. Yeah, I checked all that stuff. The videos load, and are playable online, and work if I run them locally. To be clear, the setup above loads a playable video. The issue is, there's no wav form, which can make precise annotation difficult.

I've solved it by adding to the recipe:

from prodigy.components.preprocess import fetch_media as fetch_media_preprocessor
...
if fetch_media:
    stream = fetch_media_preprocessor(stream, [file_type], skip=True)

And by setting "batch_size" to 1 in prodigy.json.

However, while this works locally, the issue is, in production, the app is timing out (I'm hosting it on AWS AppRunner, which has a hardcoded 120s timeout). Downloading and processing even 1 long video takes taking too long.

In the end, I've had to strip the video and use just mp3s as they're light enough to load.

If there's a possibility of rolling out a feature where the wav form could be displayed without downloading and converting the files to bytes, that would be amazing!

Hi @cbjrobertson ,

Many thanks for sharing your solution and the feedback. We'll definitely discuss internally your suggestion and let you know about any updates.