Dataset not automatically excluded

danielrothmann · June 5, 2019, 7:22am

For a custom image labeling recipe, we are loading in images from Google Cloud with expiring links - Therefore, I have set up a custom input hash based on the storage blob ID which is more permanent.

Only problem is, that when the Prodigy server is restarted, the already annotated example is not excluded. I have double checked that the input hashes for the same images are identical, but I am able to label and add them them to the example SQL table more than once if I restart the Prodigy server.

I have some confusion about the hashing: The documentation says that task_hash is used for exclusion, while some forum posts indicate that the input_hash is used.

Our recipe:

@prodigy.recipe('kanda_object_detection_gcloud',
    dataset=("The dataset to use.", "positional", None, str),
    bucket_name=("Name of the cloud bucket containing images.", "positional", None, str),
    label=("One or more comma-separated labels",  "positional", None, split_string)
)
def kanda_object_detection_gcloud(dataset, bucket_name, label):
    """
    Manually annotate images by drawing rectangular bounding boxes on the image. Loads images via Google Cloud Storage bucket.
    """
    stream = google_bucket_image_loader(bucket_name, dataset)

    return {
        'view_id': 'image_manual',
        'dataset': dataset,
        'stream': stream,
        'exclude': [dataset],
        'config': {
            'label': ', '.join(label),
            'labels': label,
            'darken_image': 0.1,
            "ner_manual_label_style": "dropdown"
        }
    }

Our stream:

def google_bucket_image_loader(bucket_name, dataset):
    client = storage.Client()
    bucket = client.get_bucket(bucket_name)
    blobs = bucket.list_blobs(prefix=dataset)

    for page in blobs.pages:
        for blob in page:
            # TODO: Should this implement a custom exclusion function?

            expiration_time = timedelta(hours=1)
            signing_credentials = compute_engine.IDTokenCredentials(requests.Request(), "")
            image_url = blob.generate_signed_url(expiration=expiration_time, credentials=signing_credentials, version="v4")

            task = {'image': image_url, "meta": {'blob id': blob.id, 'bucket': bucket_name, 'prefix': dataset}}
            task['_input_hash'] = mmh3.hash(blob.id)

            # task = set_hashes(task, input_keys=('blob id'), overwrite=True)

            yield task

What am I missing here?

ines · June 5, 2019, 2:01pm

By default, the task hash will be used for exclusion. You can think of the two hashes like this: the input hash represents the raw input, e.g. the image. The task hash represents the question that's being asked about that input – for instance, if you're doing image classification, you might ask several questions about the same image, one question per label. Similarly, in binary NER annotation, you can give feedback on several entity suggestions in the same text, and then merge those later on.

The input / task hash mechanism lets Prodigy (and you) distinguish between identical questions (e.g. to not ask a question twice) and different questions on the same input (e.g. to merge them later on, identify coflicts etc.).

Are the task hashes identical, too?

Topic		Replies	Views
Image classification (choice) - Duplicated images image , solved	8	1702	May 16, 2019
Avoid restarting from zero... enhancement , usage , solved	19	1983	May 10, 2018
Continue to annotate same data in new session enhancement , done	19	4005	October 5, 2018
Seeing the same images that have already been annotated usage , image , solved	3	747	November 11, 2020
filter_inputs still causes duplicated image usage , image , streams	9	1090	December 3, 2020

Dataset not automatically excluded

Related topics