Image classification (choice) - Duplicated images

Hi,

I’ve been annotating images with a choice interface using the following recipe:

@prodigy.recipe('image-choice')
def image_choice(dataset, source):
   stream = Images(source)
   options_file = 'options.p' # Dictionary with my custom options
   stream = add_options_image(stream, options_file)

return {
    "dataset": dataset,
    "stream": stream,
    "view_id": "choice",
    'config': {'choice_style': 'multiple', 'show_stats': True}
}

def add_options_image(stream, options_file):

    with open(options_file, 'rb') as fp:
        options = pickle.load(fp)

    for task in stream:
        task['options'] = options
        yield task

The image directory which I reference when calling the recipe has about 2.000 jpeg files.

The app works and I get to annotate the images exactly as I wanted, except for 1 problem: repeated images when I re-launch the app.

After doing circa 300 annotations, I closed the app to export the database and repeated this process 2 or 3 times. Every time I launch the app, the first image suggested by Prodigy is always the same.

I looked into my database export and the field meta: file shows 24 duplicates (same file). The same for _input_hash, 24 duplicates. Duplicates of image count 59, but this might be due to the fact that different files have the same image. _task_hash shows 0 duplicates.

How can I prevent Prodigy from asking me to annotate a file which has been annotated before?

PS: I chose to create a new issue because it only came up today and is different from yesterday’s.

Thanks for opening this a s a separare thread – definitlely good to keep the threads focused! :+1:

This is definitely strange – it seems like tasks with the same input somehow receive different task hashes over different runs? The _input_hash is based on the value of "image", while the _task_hash takes the input hash, plus the "spans", "label", and "options" properties into account, if available.

Is there anything in your options that could possibly change between sessions? Like, when you unpickle the file with the options or something like that? Even a tiny difference would cause the task to receive the same input hash (because same image), but a different task hash – which makes Prodigy think they’re different questions.

If you know that you’re only ever going to ask one question about one image, you could also set your own hashes and base both the input hash and task hash on the value of "image", which shouldn’t change. Prodigy will accept pre-defined hashes that are already set in the stream. For example:

for task in stream:
    task = prodigy.set_hashes(task, input_keys=["image"], task_keys=["image"])
    # and so on

Thanks Ines! Indeed I was adding new choice options every time. This is quite a normal procedure for me, I guess it could be useful for you to understand my use case:

  1. Do a few annotations with the first iteration of options. If none of the options fits the image, I hit Ignore and write down on paper the new option to add to my list
  2. I export the annotations, check the stats on the different labels
  3. From the missing options I took a note of, I add to the recipe the ones which make more sense
  4. I start Prodigy again, annotating more images, now with more options. However, I don’t want to go back and annotate images which I have previously annotated (even the Ignored ones, because I have many more images and can afford to ignore a bunch).

When I am starting a new dataset/problem, I usually go through these steps for a few iterations until I have a stable set of options.

This being said, I found a workaround where I move already annotated images to another dataset on step 2, but you solutions of basing the hash on the input only should work as well.

I never noticed this issue when annotating text because I usually work with datasets with many thousands of documents and I stream random batches of documents each time.

Ah, cool – glad you figured it out! And thanks for sharing your workflow :blush:

From what you describe, it sound like you might actually want to write your own logic that filters and excludes examples based on their input hashes. This gives you more control over how to handle duplicates at different stages of your workflow.

Custom recipes can return an on_load function that’s called when the recipe loads. It gives you access to the controller and database, so you can fetch all input hashes of the dataset when your recipe loads. In your stream, you can then check the incoming examples and only send them out if their input hash hasn’t been seen before. You could even apply more fine-grained logic here – like, only send it out if it has been seen before and meets other conditions (e.g. if you set a custom --reannotate flag on your recipe, or if the example has a certain property and so on).

This example recipe shows the use of the on_load method to get data from the database. Here, it’s only keeping counts of the answer types – but you could use the same logic to keep a set of the seen hashes.

Thanks Ines! I’ll definitely try this out

1 Like

Hi Ines! I picked up this suggestion from you again. Could you give me a short working example of how to use the on_load function to filter the examples in the stream by input hash instead of task hash?

Just realised you can also do it without returning the on_load callback, so here’s a super minimal version that shows the idea:

from prodigy.components.filters import filter_inputs
from prodigy.components.db import connect

# In your recipe function
db = connect()
input_hashes = db.get_input_hashes(dataset)

stream = []  # your stream here
stream = filter_inputs(stream, input_hashes)

Internally, all the filter_inputs helper really does is something like this:

def filter_inputs(stream, input_hashes):
    for eg in stream:
        if eg["_input_hash"] not in input_hashes:
            yield eg

That’s the underlying logic for filtering examples, so you can also write your own function and implement something custom.

Thanks Ines!

Two questions:

  1. When I don’t add this to my recipe, at which stage is the filter by _task_hash done?
  2. If I would apply this same logic to an nlp job, is it possible to filter by an unique identifier I have in the meta field of each example? (using JSONL loader)

Thanks!

In the “controller”, so after your recipe function was executed and has returned its components, and before Prodigy starts up the annotation server.

Yes, absolutely. The entire task dictionary will be saved in the database, and you can get all existing annotations for a given dataset in the database. Let’s say your examples look like this:

{"text": "Hello world", "meta": {"id": 123}}
{"text": "Blah blah", "meta": {"id": 456}}

When you annotate them, they’ll be saved to the dataset. In your recipe, you can then call db.get_dataset to load them and get the meta.id field from each examples. You now have a list of values that you can compare the incoming examples against.

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset(dataset)
# Get the meta.id field for each example
meta_ids = [eg["meta"]["id"] for eg in examples]

def filter_stream(stream):
    for eg in stream:
        if eg["meta"]["id"] not in meta_ids:
            yield eg

If you can express it in Python, you can pretty much add any conditional logic here. It’s probably not very useful, but you could even send an example out if its text is longer than X characters, of if it was annotated before but rejected and its ID is Y and some other custom meta property is Z. Or you could send a certain example out only if today is Monday or Tuesday :sweat_smile: