Using Boto3 to stream photos from an s3 bucket

Hello,

We're a small team trying to build a stream source from S3 buckets. We have a bucket full of images that need to be classified, but looking at the examples and what we've tried, it doesn't seem like there is a straightforward way to accomplish this, especially if you have a very large dataset. We have over 1M images.

Is seems like a common use-case but I cannot find any documentation pertinent to this. Any help understanding the correct way to stream images from an s3 bucket doing model-in-the-loop annotation is appreciated.

Specifically, the CLI interface requires 'source' as a required input: "prodigy classify-images [-h] dataset source" but s3 would be a non-local file path in this case.

Thank you

Hi! This should be pretty straightforward to set up :slightly_smiling_face: The default loaders load from local paths, but you can also write your own that load from anywhere else. Streams in Prodigy are Python generators that yield dictionaries – so all you need to do is connect to your bucket (however you normally do that using Boto3), get the images and output dictionaries in Prodigy's format, e.g. with the image as the key "image".

A good place to start is the documentation on custom loaders: Loaders and Input Data · Prodigy · An annotation tool for AI, Machine Learning & NLP

The size of the dataset shouldn't matter, because if your data is in an s3 bucket, you'd typically only be streaming in the URLs to the images. Prodigy will process the stream in batches, so it's only ever loading one batch at a time and you can get started immediately and don't have to wait for your whole dataset to be loaded.

The source argument of the built-in recipes is passed to the respective loader and is typically a local file path. But it doesn't have to be. You can also make it read from standard input and pipe data forward from a different process and loader (see here for an example).

If you write your loader as a Python script, you could even make it really fancy and have it take command line arguments to specify the buckets to load from, which files to select etc. That's all up to you.

Hi Ines,

Thank you for this response. It is helpful.

Should my custom stream itself be a generator that yields a batch of keys? To iterate over a large number of objects in S3 requires iteration using their 'continuation key' approach (essentially pagination).

Additionally, with such a large corpus, if Prodigy never saw the entire list of photos during an annotation session, when we start another session, will Prodigy just skip all the images it knows it has already seen and have been annotated?

Lastly - is it safe to assume that the solution proposed here for Images() function would also work if we used ImageServer() as a drop-in function replacement?

Thanks!

Yes, if your stream is set up that way, it should work fine. You probably just want to find the most efficient way to load batches of file paths (and optional metadata), like, one page at a time etc.

If you don't want to do all of this during the annotation session, another approach could be to have a separate process that periodically updates an index of all images in your bucket (with URLs, meta and a unique ID, either in a flat JSONL file, a simple SQLite DB etc.) and then have Prodigy load from that. Could be cleaner, because it means you don't have to worry about specifics of iterating over S3 objects at runtime during annotation.

Yes, that's the default behaviour. Examples deduplicated using hashes, so if you're streaming in S3 URLs, two examples with the same "image" would be considered the same input.

If it's too expensive to load a lot of objects again just to check if they've been annotated before, you could also keep your own cache of keys you sent out, or use the Prodigy database on startup to get the IDs of all annotated examples, and then skip loading those again. (Is there an efficient S3 way to tell it, get me a page of keys from this bucket, except for those?)

In general, yes – although, if I understand your use case correctly, you'd probably be skipping those loaders entirely and use your custom loader instead that streams in dictionaries and uses S3 URLs for the images. That's the problem Images and ImageServer are trying to solve: take images from a local directory and make them available in the browser, either by converting the image to base64 or by serving it with a web server. Your images are already served, so if you have a process that gets the desired image URLs from your bucket and sends out {"image": "..."} dictionaries, that's all you need.

Thank you for all of your incredible feedback. You Prodigy folk have been so awesome in the forums - such a good customer service experience.

If we roll a S3 Loader library for this, we're happy to contribute it back to the project if you had any interest in pulling it in as an available loader.

Thanks!

1 Like

Thanks, glad to hear that it makes a difference :smiley: And yes, that'd be great, I'm sure there are many users would would find it useful!

@wmelton How did you implement S3 Loader Library? I am trying to do something similar but having some difficulties.

same here, would love to have some insights

I faced a similar problem and couldn't find a good example for a solution. This is the custom loader that ended up working for me:

import boto3
import prodigy
import json
from prodigy.util import img_to_b64_uri


@prodigy.recipe("stream-from-s3")
def stream_from_s3(bucket, prefix=None):
    # Get all loaded images.
    s3 = boto3.client('s3')

    # Build a paginator for when there are a lot of objects.
    paginator = s3.get_paginator('list_objects')
    paginate_params = {
        'Bucket': bucket
    }

    # Check if only certain images from S3 should be loaded.
    if prefix is not None:
        paginate_params['Prefix'] = prefix

    page_iterator = paginator.paginate(**paginate_params)

    # Iterate through the pages.
    for page in page_iterator:
        # Iterate through items on the page.
        for obj in page['Contents']:
            img_key = obj['Key']

            # Read the image.
            img = s3.get_object(Bucket=bucket, Key=img_key).get('Body').read()

            # Provide response that Prodigy expects.
            print(json.dumps({'image': img_to_b64_uri(img, 'image/jpg')}))

You could then use the custom loader to pipe images to your annotator.

prodigy stream_from_s3 BUCKET PREFIX -F s3_loader.py | prodigy mark DATASET - --label LABEL --view-id classification

I put the code in this GitHub repo. Happy to hear any suggestions for how this could be improved!

1 Like

@matthewvielkind Oh cool, thanks so much for sharing! :100: Also happy to see that the actual code needed here is super straightforward and it mostly just comes down to calling the right s3 methods. Another option that could potentially be useful here is to just have it output the URL instead of the base64 (assuming the bucket is public).

Also, I haven't really used Boto3 much myself but what's your experience with it? Is the API stable? Does it change often? I would consider shipping a loader like this out-of-the-box with Prodigy but integrating third-party SDKs can often be tricky, and they vary greatly in stability etc. If it's something we have to keep updating a lot, it makes more sense to have it as a standalone script (because the one thing worse than not having a built-in integration for something is a built-in integration that's broken :sweat_smile:).

1 Like

@ines Certainly, happy to help! Boto3 is such a massive API since it covers so many services that they're frequently making minor updates. Despite the frequent updates the core functions, like getting an object from S3, are really stable. I've been using boto3 regularly for years and I can't recall an instance where upgrading my boto3 instance caused major breaking changes with my workflows reading/writing from S3.

@matthewvielkind Thanks, that's really good to know. I'll put this on my list of enhancements then, it'd definitely be cool to have an S3 loader out of the box :blush: I'll probably take some inspiration from your code once I start working on this. Thanks again for sharing!