Image classification (choice) - Duplicated images

Thanks for opening this a s a separare thread – definitlely good to keep the threads focused! :+1:

This is definitely strange – it seems like tasks with the same input somehow receive different task hashes over different runs? The _input_hash is based on the value of "image", while the _task_hash takes the input hash, plus the "spans", "label", and "options" properties into account, if available.

Is there anything in your options that could possibly change between sessions? Like, when you unpickle the file with the options or something like that? Even a tiny difference would cause the task to receive the same input hash (because same image), but a different task hash – which makes Prodigy think they’re different questions.

If you know that you’re only ever going to ask one question about one image, you could also set your own hashes and base both the input hash and task hash on the value of "image", which shouldn’t change. Prodigy will accept pre-defined hashes that are already set in the stream. For example:

for task in stream:
    task = prodigy.set_hashes(task, input_keys=["image"], task_keys=["image"])
    # and so on