I have been working with the prodigy db, and im confused as to what the difference is between _input_hash and _task_hash which are assigned to each task.
I couldnt find anything satisfactory in the documentation.
Can you please help?
Hi! You can read more about the hashing here: https://prodi.gy/docs/api-loaders#hashing
A simple example: you might be annotating data for text classification with two categories, so you make two passes over the data, one with LABEL_A
and one with LABEL_B
. Each time, you accept/reject whether the label applies. This means you end up with two examples for each text: one with LABEL_A
and one with LABEL_B
. Both examples will have the same input hash, because they're questions about the same input data, the text. But they will have different task hashes, because they're different questions.
{"text": "Text", "label": "LABEL_A", "_input_hash": 1, "_task_hash": 2}
{"text": "Text", "label": "LABEL_B", "_input_hash": 1, "_task_hash": 3}
{"text": "Other", "label": "LABEL_A", "_input_hash": 2, "_task_hash": 4}
Using the input hashes and task hashes, Prodigy (or you) can also figure out whether two annotations are on the same data and use this information to merge your examples later on. For example, data-to-spacy
will group all annotations with the same input hash together, so you'll get one example annotated with all categories, entities, POS tags or whatever else you labelled.