Labelling a set of images using a custom recipe


I am currently in the process of developing a custom prodigy recipe to label my own images within a specific folder. My approach is largely based on the example outlined in the "Assigning multiple labels to images " section of "Computer Vision · Prodigy · An annotation tool for AI, Machine Learning & NLP."

I have made several modifications to this example recipe such as adding a "before_db" function that eliminates the base-64 representation of the image from the output jsonl, substituting it with the corresponding image file path instead. As well as this, I have changed the "classify_images" function to allow users to choose what labels they want to use through the command line. The command I would use to begin annotating would look something like:

$ python -m prodigy classify-images image_dataset ./images --label "A,B,C" -F

While the recipe appears to be working, I have encountered an issue during the image labelling process. I have noticed that certain images randomly appear multiple times, meaning that the same image will be annotated more than once. For example, if I only have 100 images in a folder I will end up with something like 130 annotations because some images have randomly appeared more than once.

I am uncertain about the root cause of this issue. It is worth noting that my image folder consists of a combination of PNG and JPEG files, and I'll attach my custom recipe file for your reference.

Any help or suggestions would be much appreciated! :slight_smile:

import prodigy
from prodigy.components.loaders import Images
from prodigy.util import split_string
from typing import List

@prodigy.recipe("classify-images",  label=("Comma-separated label(s)", "option", "l", split_string))
def classify_images(dataset, source, label: List[str]):
    OPTIONS = []
    number = 0
    for category in label:
        OPTIONS.append({"id": number, "text": category})
        number += 1

    # OPTIONS=label
    def get_stream():
        # Load the directory of images and add options to each task
        stream = Images(source)
        for eg in stream:
            eg["options"] = OPTIONS
            # eg = eg["path", "options", "accept", "answer"]
            yield eg

    def before_db(examples):
        for eg in examples:
            # If the image is a base64 string and the path to the original file
            # is present in the task, remove the image data
            if eg["image"].startswith("data:") and "path" in eg:
                eg["image"] = eg["path"]
        return examples

    return {
        "before_db": before_db,
        "dataset": dataset,
        "stream": get_stream(),
        "view_id": "choice",
        "config": {
            "choice_style": "multiple",  # or "single"
            # Automatically accept and submit the answer if an option is
            # selected (only available for single-choice tasks)
            "choice_auto_accept": False

Hi Alex!

I don't see something strange with your recipe just from glancing at it. I'd be willing to toy with it locally on my machine, but I figured I'd ask a few quick questions first.

  1. Are you labelling this on your own or with a team? If you're doing it as a team, what are the feed_overlap settings?
  2. What version of Prodigy are you using? Are you using the new alpha version with the new task routers?

If I had to guess ... I wonder if maybe something is up the hashes. Your custom recipe is not setting the hash via the set_hashes function which means that Prodigy needs to guess how to set the hash. And your before_db call is overwriting the eg['image'] key, which is one of the keys that is used by default.

Could you try again but now set the hashes yourself?

Hi @koaning!

Thank you ever so much for your response!

To answer your questions:

  1. I am annotating this by myself, not in a team.
  2. I’m using a slightly older version of prodigy (version 1.11.6)

I have just tweaked my “get_stream” function to include a line which sets the hash_id:

eg = prodigy.set_hashes(eg, input_keys=["image"], task_keys=["image"])

Such that it is now:

def get_stream():
        # Load the directory of images and add options to each task
        stream = Images(source)
        for eg in stream:
            eg["options"] = OPTIONS  
            eg = prodigy.set_hashes(eg, input_keys=["image"], task_keys=["image"])
            yield eg

I just tested this tweaked recipe on an image folder with 96 images and it seemed to solve the duplicated image problem! However, I did notice that this time it skipped a few images so once I had finished annotating I was left with 87 annotations in total (suggesting 9 images had been skipped). Any ideas as to what might be causing this now?


It's hard to know 100% for sure, but it might help to revisit the hashing mechanism. I drew a simplified version of it in the diagram below.

The thinking is that we don't want to see examples that we've already seen before so we compare the hash values in the database with the hash values of the examples in the new stream.

So it could be that your dataset has some sort of a duplicate. Either there's an example with the same hash value in the database or in the same stream of examples on disk. There are some more details to this, because you can specify if you want to exclude by the task hash or by the input hash. But this is the gist of it.

It's very hard to know for sure without having access to your dataset, but this is my gut feeling at this point. If you feel like there's a bug though; do let me know! I'll gladly look at a minimum viable example to make sure it's working as expected.

Thanks again for your response @koaning!

After looking at my dataset there was in fact duplicated images. Because I only used "images" to set the hash id prodigy was assigning these images with the same id and removing the duplicates to prevent multiple annotations. The custom recipe is working perfectly now, thank you for all your help! :smile:

1 Like

Happy to hear it!

Feel free to ask again if you happen to hit your head somewhere.