Labelling a set of images (classification)

Hi! It sounds like you're definitely on the right track :slightly_smiling_face:

Instead of using the mark recipe, which really just streams in what you give it, you might actually find it easier to just implement a custom recipe for this, since it'll make it more obvious what's going on and lets you add your own custom logic (e.g. for shuffling, removing base64 and maybe other stuff).

This example recipe actually goes in a very similar directon: Computer Vision · Prodigy · An annotation tool for AI, Machine Learning & NLP – only that in your case, you'd add a single "label" to the examples instead of "options", and use the classification interface instead of choice.

Sure, that's definitely reasonable. When you load your images from a directory using the Images loader, what you get back is a regular Python generator that yields dictionaries:

stream = Images(source)

The most straightforward solution would be to just call list() and random.shuffle() on it to shuffle it – however, this will consume the whole generator upfront. Another option would be to go through your stream and use some heuristic to (randomly) decide whether or not to send out a given example for annotation. For instance:

def get_random_stream():
    stream = Images(source)
    for eg in stream:
        if random.random() > 0.7:  # or whatever
            yield eg

Feed overlap isn't what you want here, because that just controls whether multiple annotators in different sessions are asked about the same example or not.

By default, Prodigy will generate two hashes for each example: one representing the input (e.g. the image) and one representing the question about the image (e.g. image + label). If an example with the same task hash is already present in the current dataset, you shouldn't be asked about it again. So you'd see different questions about the same image, but not the same questions about the same image. Alternatively, you can also set "exclude_by": "input" in the "config" returned by your recipe to exclude based on the input hash. In that case, you would only see a given image once.

If your images don't change between runs and you're saving your annotations to the same dataset, you should only be asked about images you haven't annotated yet.

You can do this in your custom recipe by adding a before_db callback, that can modify examples in place before they're added to the database. Here's an example of the same code the built-in image recipes use to implement --remove-base64: Custom Recipes · Prodigy · An annotation tool for AI, Machine Learning & NLP

This will replace the base64 string with the path, so you just need to make sure the files don't change. Definitely be careful here, though, because you don't want to accidentally destroy any data.

If you don't want to convert the images to base64, you can also use the ImageServer loader instead, which provides the image URLs via a local web server: Loaders and Input Data · Prodigy · An annotation tool for AI, Machine Learning & NLP Another alternative is to just put your images in an S3 bucket or similar and use the URLs instead.

2 Likes