Using Boto3 to stream photos from an s3 bucket


We're a small team trying to build a stream source from S3 buckets. We have a bucket full of images that need to be classified, but looking at the examples and what we've tried, it doesn't seem like there is a straightforward way to accomplish this, especially if you have a very large dataset. We have over 1M images.

Is seems like a common use-case but I cannot find any documentation pertinent to this. Any help understanding the correct way to stream images from an s3 bucket doing model-in-the-loop annotation is appreciated.

Specifically, the CLI interface requires 'source' as a required input: "prodigy classify-images [-h] dataset source" but s3 would be a non-local file path in this case.

Thank you

Hi! This should be pretty straightforward to set up :slightly_smiling_face: The default loaders load from local paths, but you can also write your own that load from anywhere else. Streams in Prodigy are Python generators that yield dictionaries – so all you need to do is connect to your bucket (however you normally do that using Boto3), get the images and output dictionaries in Prodigy's format, e.g. with the image as the key "image".

A good place to start is the documentation on custom loaders:

The size of the dataset shouldn't matter, because if your data is in an s3 bucket, you'd typically only be streaming in the URLs to the images. Prodigy will process the stream in batches, so it's only ever loading one batch at a time and you can get started immediately and don't have to wait for your whole dataset to be loaded.

The source argument of the built-in recipes is passed to the respective loader and is typically a local file path. But it doesn't have to be. You can also make it read from standard input and pipe data forward from a different process and loader (see here for an example).

If you write your loader as a Python script, you could even make it really fancy and have it take command line arguments to specify the buckets to load from, which files to select etc. That's all up to you.

Hi Ines,

Thank you for this response. It is helpful.

Should my custom stream itself be a generator that yields a batch of keys? To iterate over a large number of objects in S3 requires iteration using their 'continuation key' approach (essentially pagination).

Additionally, with such a large corpus, if Prodigy never saw the entire list of photos during an annotation session, when we start another session, will Prodigy just skip all the images it knows it has already seen and have been annotated?

Lastly - is it safe to assume that the solution proposed here for Images() function would also work if we used ImageServer() as a drop-in function replacement?


Yes, if your stream is set up that way, it should work fine. You probably just want to find the most efficient way to load batches of file paths (and optional metadata), like, one page at a time etc.

If you don't want to do all of this during the annotation session, another approach could be to have a separate process that periodically updates an index of all images in your bucket (with URLs, meta and a unique ID, either in a flat JSONL file, a simple SQLite DB etc.) and then have Prodigy load from that. Could be cleaner, because it means you don't have to worry about specifics of iterating over S3 objects at runtime during annotation.

Yes, that's the default behaviour. Examples deduplicated using hashes, so if you're streaming in S3 URLs, two examples with the same "image" would be considered the same input.

If it's too expensive to load a lot of objects again just to check if they've been annotated before, you could also keep your own cache of keys you sent out, or use the Prodigy database on startup to get the IDs of all annotated examples, and then skip loading those again. (Is there an efficient S3 way to tell it, get me a page of keys from this bucket, except for those?)

In general, yes – although, if I understand your use case correctly, you'd probably be skipping those loaders entirely and use your custom loader instead that streams in dictionaries and uses S3 URLs for the images. That's the problem Images and ImageServer are trying to solve: take images from a local directory and make them available in the browser, either by converting the image to base64 or by serving it with a web server. Your images are already served, so if you have a process that gets the desired image URLs from your bucket and sends out {"image": "..."} dictionaries, that's all you need.

Thank you for all of your incredible feedback. You Prodigy folk have been so awesome in the forums - such a good customer service experience.

If we roll a S3 Loader library for this, we're happy to contribute it back to the project if you had any interest in pulling it in as an available loader.


1 Like

Thanks, glad to hear that it makes a difference :smiley: And yes, that'd be great, I'm sure there are many users would would find it useful!

@wmelton How did you implement S3 Loader Library? I am trying to do something similar but having some difficulties.