Sure! So at the moment, the Image
loader takes the directory of images, loads and converts every image to base64 and then creates a stream that looks something like this:
{"image": "......"}
{"image": "......"}
But this produces super long strings and large files, because the images are included in the data. Alternatively, you could also upload your images somewhere and make the data look like this:
{"image": "https://example.com/image1.jpg"}
{"image": "https://example.com/image2.jpg"}
Instead of a public URL, this could also be on localhost
. To implement this in your recipe, one idea could be to serve up the directory of images via a simple HTTP server (or something more sophisticated, if you prefer) and then make your recipe load the same directory and create tasks with the respective file names on your local server. If a file image.jpg
exists in your directory, you know it'll also exist at localhost:1234/image.jpg
or whatever. Here's an example:
from pathlib import Path
local_server = "http://localhost:1234/"
def get_stream():
source_path = Path(source) # The directory with your images
for image_path in source_path.iterdir(): # Iterate over the files in the directory
# Make sure we're only using actual images from the directory
if image_path.suffix in ('.jpg', '.jpeg', '.png'):
# Create the local URL, e.g. http://localhost:1234/image.png
image_url = local_server + image_path.name
yield {"image": image_url}
stream = get_stream()
If you mean excluding already annotated examples when you restart the server: This should be handled automatically, because the incoming tasks still receive hashes, and the same image should receive the same hash.
If you mean excluding duplicate incoming images (e.g. the same image with different file names), that's a little trickier. If you use the base64-encoded images, it should work in theory, because the image is converted to a string and if it's the same image, that string should be identical. But otherwise, you might have to actually load and diff the image – I'm sure this can be done in Python, but it might not be worth the hassle.