For a custom image labeling recipe, we are loading in images from Google Cloud with expiring links - Therefore, I have set up a custom input hash based on the storage blob ID which is more permanent.
Only problem is, that when the Prodigy server is restarted, the already annotated example is not excluded. I have double checked that the input hashes for the same images are identical, but I am able to label and add them them to the example SQL table more than once if I restart the Prodigy server.
I have some confusion about the hashing: The documentation says that task_hash
is used for exclusion, while some forum posts indicate that the input_hash
is used.
Our recipe:
@prodigy.recipe('kanda_object_detection_gcloud',
dataset=("The dataset to use.", "positional", None, str),
bucket_name=("Name of the cloud bucket containing images.", "positional", None, str),
label=("One or more comma-separated labels", "positional", None, split_string)
)
def kanda_object_detection_gcloud(dataset, bucket_name, label):
"""
Manually annotate images by drawing rectangular bounding boxes on the image. Loads images via Google Cloud Storage bucket.
"""
stream = google_bucket_image_loader(bucket_name, dataset)
return {
'view_id': 'image_manual',
'dataset': dataset,
'stream': stream,
'exclude': [dataset],
'config': {
'label': ', '.join(label),
'labels': label,
'darken_image': 0.1,
"ner_manual_label_style": "dropdown"
}
}
Our stream:
def google_bucket_image_loader(bucket_name, dataset):
client = storage.Client()
bucket = client.get_bucket(bucket_name)
blobs = bucket.list_blobs(prefix=dataset)
for page in blobs.pages:
for blob in page:
# TODO: Should this implement a custom exclusion function?
expiration_time = timedelta(hours=1)
signing_credentials = compute_engine.IDTokenCredentials(requests.Request(), "")
image_url = blob.generate_signed_url(expiration=expiration_time, credentials=signing_credentials, version="v4")
task = {'image': image_url, "meta": {'blob id': blob.id, 'bucket': bucket_name, 'prefix': dataset}}
task['_input_hash'] = mmh3.hash(blob.id)
# task = set_hashes(task, input_keys=('blob id'), overwrite=True)
yield task
What am I missing here?