Difference between Input hash and task hash

Hi! You can read more about the hashing here: https://prodi.gy/docs/api-loaders#hashing

A simple example: you might be annotating data for text classification with two categories, so you make two passes over the data, one with LABEL_A and one with LABEL_B. Each time, you accept/reject whether the label applies. This means you end up with two examples for each text: one with LABEL_A and one with LABEL_B. Both examples will have the same input hash, because they're questions about the same input data, the text. But they will have different task hashes, because they're different questions.

{"text": "Text", "label": "LABEL_A", "_input_hash": 1, "_task_hash": 2}
{"text": "Text", "label": "LABEL_B", "_input_hash": 1, "_task_hash": 3}
{"text": "Other", "label": "LABEL_A", "_input_hash": 2, "_task_hash": 4}

Using the input hashes and task hashes, Prodigy (or you) can also figure out whether two annotations are on the same data and use this information to merge your examples later on. For example, data-to-spacy will group all annotations with the same input hash together, so you'll get one example annotated with all categories, entities, POS tags or whatever else you labelled.

1 Like