Stream videos from AWS S3 using boto3 with audio wav form displaying correctly

cbjrobertson · August 7, 2024, 1:18pm

Hi, I'm trying to build a custom loader to stream in video data from S3 for a diarization task (based on the audio.manual recipe. I've based my approach on this answer.

I've adjusted it slightly by loading the videos from a signed_url which should behave exactly like a public URL. However, whilst the videos are loading the audio wav form is blank (see below). Since this is a diarization task, the wav form is critical.

Fig. 1: Wav form is missing.

This my current custom stream generator:

import boto3
from config import Config
import re
from botocore import client
from datetime import datetime

class S3Service(object):
    def __init__(
            self,
    ):
        self.bucket = Config.bucket_name
        self.s3 = self.get_s3()

    @staticmethod
    def get_s3():
        s3 = boto3.client(
            's3',
            aws_access_key_id=Config.aws_access_key_id,
            aws_secret_access_key=Config.aws_secret_access_key,
            config=client.Config(signature_version='s3v4')
        )
        return s3

    @staticmethod
    def get_s3_direct_file_regex(
            prefix
    ):
        if not prefix.endswith('/'):
            prefix += '/'
        escaped_subdir = re.escape(prefix)
        pattern = rf'^{escaped_subdir}[^/]+$'
        return re.compile(pattern)

    def generate_signed_url(
            self,
            object_key,
    ) -> str:
        signed_url = self.s3.generate_presigned_url(
            ClientMethod='get_object',
            Params={
                'Bucket': self.bucket,
                'Key': object_key,
            },
            ExpiresIn=Config.expires_in
        )
        return signed_url

    def stream_from_s3(
            self,
            file_type,
            prefix=None,
    ):
        paginator = self.s3.get_paginator('list_objects')
        paginate_params = {
            'Bucket': self.bucket
        }

        if prefix is not None:
            paginate_params['Prefix'] = prefix

        page_iterator = paginator.paginate(**paginate_params)
        pattern = self.get_s3_direct_file_regex(prefix)

        for page in page_iterator:
            for obj in page['Contents']:
                if pattern.match(obj.get("Key")):
                    object_key = obj['Key']
                    signed_url = self.generate_signed_url(
                        object_key,
                    )
                    annotation_element = {
                        file_type: signed_url,
                        'meta': {
                            "s3_object_key": f"s3://{self.bucket}/{object_key}",
                            "time_stamp": datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
                        }
                    }
                    yield annotation_element

This is called directly in a custom recipe as follows:

stream = s3_service.stream_from_s3(file_type=file_type, prefix=prefix)

Any advice to get the wav form to display from a URL? I've also experimented with reading the video and converting to base64 with both python native and prodigy converters, but such approaches have just hung indefinitely. Any help appreciated!

magdaaniol · August 12, 2024, 9:34am

Hi @cbjrobertson ,

Nothing immediately stands out as wrong in your loader.
Could you try one of the signed URLs explicitly and make sure they're accessible via the browser?
Another source of the problem could be the file format. What is the video format that you're loading?
I recommend downloading one of these videos and see if you can load it from a local path without errors (you can specify the path to a local dir that contains the video as source argument of audio.manual. )
If the videos are longer you could try with the video server loader locally, instead.
Let us know how it goes!

cbjrobertson · August 29, 2024, 10:10am

Hi there --

Thanks for the reply. Yeah, I checked all that stuff. The videos load, and are playable online, and work if I run them locally. To be clear, the setup above loads a playable video. The issue is, there's no wav form, which can make precise annotation difficult.

I've solved it by adding to the recipe:

from prodigy.components.preprocess import fetch_media as fetch_media_preprocessor
...
if fetch_media:
    stream = fetch_media_preprocessor(stream, [file_type], skip=True)

And by setting "batch_size" to 1 in prodigy.json.

However, while this works locally, the issue is, in production, the app is timing out (I'm hosting it on AWS AppRunner, which has a hardcoded 120s timeout). Downloading and processing even 1 long video takes taking too long.

In the end, I've had to strip the video and use just mp3s as they're light enough to load.

If there's a possibility of rolling out a feature where the wav form could be displayed without downloading and converting the files to bytes, that would be amazing!

magdaaniol · August 31, 2024, 4:45pm

Hi @cbjrobertson ,

Many thanks for sharing your solution and the feedback. We'll definitely discuss internally your suggestion and let you know about any updates.

Topic		Replies	Views
Using Boto3 to stream photos from an s3 bucket usage , image , aws , streams	11	3544	March 4, 2021
No Task Available Error and S3 loader for custom recipe usage , image , streams , video	7	783	December 13, 2020
Multiple s3 bucket to stream in a custom recipe usage , image , custom , streams	5	313	November 22, 2023
audio.manual not showing video when using video-server loader video	1	404	September 8, 2021
Duplicates in AudioVideo tasks usage , streams	2	387	November 4, 2021

Stream videos from AWS S3 using boto3 with audio wav form displaying correctly

Related topics