Dataset not automatically excluded

For a custom image labeling recipe, we are loading in images from Google Cloud with expiring links - Therefore, I have set up a custom input hash based on the storage blob ID which is more permanent.

Only problem is, that when the Prodigy server is restarted, the already annotated example is not excluded. I have double checked that the input hashes for the same images are identical, but I am able to label and add them them to the example SQL table more than once if I restart the Prodigy server.

I have some confusion about the hashing: The documentation says that task_hash is used for exclusion, while some forum posts indicate that the input_hash is used.

Our recipe:

@prodigy.recipe('kanda_object_detection_gcloud',
    dataset=("The dataset to use.", "positional", None, str),
    bucket_name=("Name of the cloud bucket containing images.", "positional", None, str),
    label=("One or more comma-separated labels",  "positional", None, split_string)
)
def kanda_object_detection_gcloud(dataset, bucket_name, label):
    """
    Manually annotate images by drawing rectangular bounding boxes on the image. Loads images via Google Cloud Storage bucket.
    """
    stream = google_bucket_image_loader(bucket_name, dataset)

    return {
        'view_id': 'image_manual',
        'dataset': dataset,
        'stream': stream,
        'exclude': [dataset],
        'config': {
            'label': ', '.join(label),
            'labels': label,
            'darken_image': 0.1,
            "ner_manual_label_style": "dropdown"
        }
    }

Our stream:

def google_bucket_image_loader(bucket_name, dataset):
    client = storage.Client()
    bucket = client.get_bucket(bucket_name)
    blobs = bucket.list_blobs(prefix=dataset)

    for page in blobs.pages:
        for blob in page:
            # TODO: Should this implement a custom exclusion function?

            expiration_time = timedelta(hours=1)
            signing_credentials = compute_engine.IDTokenCredentials(requests.Request(), "")
            image_url = blob.generate_signed_url(expiration=expiration_time, credentials=signing_credentials, version="v4")

            task = {'image': image_url, "meta": {'blob id': blob.id, 'bucket': bucket_name, 'prefix': dataset}}
            task['_input_hash'] = mmh3.hash(blob.id)

            # task = set_hashes(task, input_keys=('blob id'), overwrite=True)

            yield task

What am I missing here?

By default, the task hash will be used for exclusion. You can think of the two hashes like this: the input hash represents the raw input, e.g. the image. The task hash represents the question that’s being asked about that input – for instance, if you’re doing image classification, you might ask several questions about the same image, one question per label. Similarly, in binary NER annotation, you can give feedback on several entity suggestions in the same text, and then merge those later on.

The input / task hash mechanism lets Prodigy (and you) distinguish between identical questions (e.g. to not ask a question twice) and different questions on the same input (e.g. to merge them later on, identify coflicts etc.).

Are the task hashes identical, too?