Labelling a set of images (classification)

I'm working through what I imagine must be a pretty standard use case:

  • load in a directory full of images
  • select whether the image is a thing or is not (binary classification)
  • use those classifications (i.e. the image name tied to whatever the correct classification was) later on in a computer vision DL project.

I can't quite seem to get Prodigy to do what I want, though. I tried using the mark recipe as suggested in your docs, but this doesn't quite do it. I had a few specific questions off the back of this:

Random order

I read this support topic where it is stated that images are loaded in alphabetical order. I have 65K+ images in my directory. I don't want to label them all, at least not initially. I want to label a random sample of a few hundred or a thousand to get a sense of a baseline for this collection of images. To get this, I'll want to be served the images randomly.

Q: Is there a way to have images loaded in randomly? (I reckon one way of doing this would be to rename all the files with random alphanumeric strings, though then I'd lose the original file names. Is there another way?)

Annotating the same images multiple times

I also noted that when I stopped the training (CMD-S to save the annotations to the database, quitting the process in the terminal with ctrl-C), and then restarted, it prompted me to annotate from the very beginning again, including all the images that I'd already labelled.

I saw this support/form thread. I added the setting "feed_overlap": false into my prodigy.json but it did nothing. I'm also not quite sure what that setting does exactly (and whether I should remove it.

Q: Is there a way to have Prodigy not ask me to relabel images that I've already labelled?

Not saving the original images with mark

And I noticed that (as per the documentation) Prodigy is saving the actual image files themselves in the database. I wasn't quite sure why this was happening. I mainly just want to save the annotation itself, tied to the original filename.

Q: Is there a way to set --remove-base64 when using the mark recipe?

Hi! It sounds like you're definitely on the right track :slightly_smiling_face:

Instead of using the mark recipe, which really just streams in what you give it, you might actually find it easier to just implement a custom recipe for this, since it'll make it more obvious what's going on and lets you add your own custom logic (e.g. for shuffling, removing base64 and maybe other stuff).

This example recipe actually goes in a very similar directon: – only that in your case, you'd add a single "label" to the examples instead of "options", and use the classification interface instead of choice.

Sure, that's definitely reasonable. When you load your images from a directory using the Images loader, what you get back is a regular Python generator that yields dictionaries:

stream = Images(source)

The most straightforward solution would be to just call list() and random.shuffle() on it to shuffle it – however, this will consume the whole generator upfront. Another option would be to go through your stream and use some heuristic to (randomly) decide whether or not to send out a given example for annotation. For instance:

def get_random_stream():
    stream = Images(source)
    for eg in stream:
        if random.random() > 0.7:  # or whatever
            yield eg

Feed overlap isn't what you want here, because that just controls whether multiple annotators in different sessions are asked about the same example or not.

By default, Prodigy will generate two hashes for each example: one representing the input (e.g. the image) and one representing the question about the image (e.g. image + label). If an example with the same task hash is already present in the current dataset, you shouldn't be asked about it again. So you'd see different questions about the same image, but not the same questions about the same image. Alternatively, you can also set "exclude_by": "input" in the "config" returned by your recipe to exclude based on the input hash. In that case, you would only see a given image once.

If your images don't change between runs and you're saving your annotations to the same dataset, you should only be asked about images you haven't annotated yet.

You can do this in your custom recipe by adding a before_db callback, that can modify examples in place before they're added to the database. Here's an example of the same code the built-in image recipes use to implement --remove-base64:

This will replace the base64 string with the path, so you just need to make sure the files don't change. Definitely be careful here, though, because you don't want to accidentally destroy any data.

If you don't want to convert the images to base64, you can also use the ImageServer loader instead, which provides the image URLs via a local web server: Another alternative is to just put your images in an S3 bucket or similar and use the URLs instead.


Love this reply. Thank you for the detail. Was enough for me to get what I need working.

I thought writing up a custom recipe was going to be super fiddly / hard, but with the example overlap it wasn't actually too bad :slight_smile:

I'll probably write up the things I learnt as a blog post, if only for futureme to remember what I did.

The only thing I haven't been able to get working is the exclude_by parameter in the config. I set it to input instead of task. You can see in the following code that I've temporarily disabled the get_random_stream functionality in order to debug. I continue to get the same images that I've already annotated. And this despite (when I export the annotations to check) the hashes being identical for both input and task.

Any idea what I'm doing wrong there?

import prodigy
import random
from prodigy.components.loaders import Images

LABEL = "some_label"

def before_db(examples):
    for eg in examples:
        if eg["image"].startswith("data:") and "path" in eg:
            eg["image"] = eg["path"]
    return examples

def classify_images(dataset, source):
    # def get_random_stream():
    #     stream = Images(source)
    #     for eg in stream:
    #         if random.random() < 0.05:  # or whatever
    #             yield eg

    def get_stream():
        # stream = get_random_stream()
        stream = Images(source)
        for eg in stream:
            eg["label"] = LABEL
            yield eg

    return {
        "dataset": dataset,
        "stream": get_stream(),
        "view_id": "classification",
        "before_db": before_db,
        "config": {
            "choice_style": "single",
            "exclude_by": "input"

Glad to hear the custom recipe worked :tada:

This is definitely strange, especially if you've confirmed that the hashes are identical :thinking: Two things to check:

  • Which version of Prodigy are you using and if you're not on the latest, can you try upgrading?
  • Do you still have the feed_overlap setting in your config by any chance (since you mentioned playing with that before)? And if so, does removing it help? It shouldn't actually change anything, especially not in the latest version, but maybe Prodigy still thinks that every new annotation session you start is a new session that should be annotated with overlap.

I'm using the 1.11.2 version of Prodigy, which I believe is the latest version. In case also useful: Platform macOS-10.15.7-x86_64-i386-64bit.

I did have the feed_overlap still turned on in the general config settings. So I turned that off and now it works as expected!

Problem solved!

Thanks for all your help with this. Much appreciated.

To bring this full circle, @ines, I wrote up a blog post showing how I used Prodigy in my workflow.

Thanks again for your help!

1 Like

@strickvl Ohh cool, this looks great – thanks so much for sharing :star_struck: Will check it out!

1 Like

I wanted to write up some brief notes on how I got Prodigy to start serving me images in a random order.

Some context: I have a folder with 100,000+ images inside it, and I would like to annotate some of these images, but in a random order.

So far I've just been following the way that Prodigy does things, i.e. it serves the images one by one in some sort of fixed order (alphabetical, I think).

I tried the suggested behaviour (as above), but I found that with so many source images, I couldn't really set the random.random() threshold to anything below 0.03 (as it was taking too long for it to get a random number within that tiny window. All of this meant that my annotation process was frequently paused while a batch of images was selected, and I still ended up with huge class imbalances in my annotations because I always ended up annotating images from the start of the list, and often those images were the same because the images began with the same prefix string.

What I've done now is I renamed all my files with a 5-character random alphanumeric string. The annotations process works as you'd expect, though MUCH faster now that I don't have to do all this random number threshold calculation. Then I have a script that converts the jsonl data that Prodigy exports into a json file that's appropriate for COCO annotations. During that conversion process, I ignore the 5-char prefix when I'm writing the annotations, and I rename the relevant files to their original names + copy them to a new folder outside the main data source.

It's super hacky, and I wish Prodigy handled the serving of files in a random order for me for the annotation, but it'll work for now.

1 Like