Logic behind hash keys (in relation to REVIEW API)

Hello Prodigy Team! I am writing a custom review recipe, and trying to understand the thought process behind defining default hash keys.

When hashing tasks just by calling set_hashes, we use the following:

from prodigy.util import IGNORE_HASH_KEYS, INPUT_HASH_KEYS, TASK_HASH_KEYS
print("INPUT_HASH_KEYS:", INPUT_HASH_KEYS)
print("TASK_HASH_KEYS:", TASK_HASH_KEYS)
print("IGNORE_HASH_KEYS:", IGNORE_HASH_KEYS)

INPUT_HASH_KEYS: ('text', 'image', 'html', 'input')
TASK_HASH_KEYS: ('spans', 'label', 'options', 'arcs')
IGNORE_HASH_KEYS: ('score', 'rank', 'model', 'source', 'pattern', 'priority', 'path', '_view_id', '_session_id', '_annotator_id', 'answer')

It seems that the initial idea was to hash the input in _input_hash and what needs to be done in _task_hash. Is that right?
Any reason relations is not in the TASK_HASH_KEYS by the way?

Now, when I look at the review recipe, the hash keys become:

INPUT_KEYS = ("text", "image", "html", "options", "audio", "video")
TASK_KEYS = ("spans", "label", "accept", "audio_spans", "relations")

options migrated to the input, and answer and relations added to the task hash.
So, now it seems that the _task_hash's meaning changes from "what needed to be done" to "what was actually done"?

I have no issues with changing the task hash based on user actions, however when we change the input hash keys, we lose the ability to link to the original documents that were reviewed... What are your thoughts on that? I guess, we can re-hash the reviewed result using the original keys in before_db?

Another question / observation. When we include ignored and rejected answers in the review, I imagine we want to see different answers separately, no? answer is not in the TASK_KEYS (and also is in IGNORE).

Now when I thought a bit more, it seems that a reasonable solution would be to recalculate _task_hash only in the review recipe. I saw a mention of get_hash function in one of the threads on this forum. It is not documented, is it safe to use it? (i.e. it won't disappear? :grinning:)

Hi @nrodnova ,

You're right in thinking that INPUT_HASH is meant to group the examples with same annotation inputs i.e. what the annotators saw, while the TASK_HASH are meant to group identical annotations i.e. what the annotators did. I'd like to stress, though thatTASK_HASH is always about "what was actually done". If the values of these keys are empty or non-existent they have no effect on the hash value.
The purpose of having both INPUT_HASH and TASK_HASH is to be able to distinguish between the different questions (TASK_HASH) about the same input (INPUT_HASH) and merge/filter examples accordingly.

The reason why relations is not included in the default TASK_HASH keys is mostly because Prodigy and spaCy do not provide a built in trainable component for relations. So technically, there's no use for it inside the library except for the review workflow where it is included as you've noticed. With hindsight though, it should be probably be included to represent the annotation accurately. We'll take that into consideration, for sure.

The review recipe uses 2 different ways to select examples for review depending whether annotations are interpreted as manual or binary.
For manual annotation, the examples are grouped by input. In other words, we want to consider all annotations with the same INPUT_HASH as different versions.
Conversely, for binary annotations where the annotators are only providing accept/reject decision about the same input, the annotations are grouped by task. In other words, you consider different answers on the same exact annotation (TASK_HASH) as different versions.
With that in mind, we want to treat annotation done via choiceinterface as manual annotations (= by input) and that's the reason we include options among the INPUT_HASH keys and accept as TASK_HASH key.
Hope that makes sense?

I have no issues with changing the task hash based on user actions, however when we change the input hash keys, we lose the ability to link to the original documents that were reviewed...

The original documents that were reviewed are stored under versions key in the review recipe output. Also, hash keys are deterministic so you can always recompute them using the keys you need e.g in before_db as you mention.

Another question / observation. When we include ignored and rejected answers in the review, I imagine we want to see different answers separately, no? answer is not in the TASK_KEYS (and also is in IGNORE).

By default, review filters the accepted answers only for manual annotations. The logic here is that there's nothing to compare if someone ignored or rejected an example. These rejected/ignored examples can't be used for training, either.
The value of the answer field is considered for reviewing binary annotations. Here ignore answers are excluded for the same reason of not providing any annotation.
If you would like to include all types of answers (accept, reject, ignore) in your custom review of manual annotation, then yes, I suppose they should be handled as separate "versions" of the annotation.

Finally, for recomputing hash values you can use the set_hashes helper.
The get_hashes database method is meant to get a "snapshot" of hash values currently stored in the given dataset in the DB.

Thank you @magdaaniol for the detailed explanation!
I've done quite some work on the review part since I asked this questions, and got myself in trouble :slight_smile: I added re-hashing of the annotated tasks in before_db for regular annotations (non-review), and it caused controller (or task router?) feed annotators the same questions over and over. Because it looks like that code is using _task_hash (as opposed to _input_hash) to determine if the task has been already annotated. Any reason why _input_hash is not used for that purpose? Also, is there a way to change it somewhere in controller?

Currently, task_hash doesn't reflect the annotation. The tasks are re-hashed before the annotation (and, as I explained above, things break if the tasks are rehashed after the annotation and before going to the database). So, say, if I want to quickly see how many disagreements between annotators I have in the dataset, I have to read the whole thing, rehash it and then get my answer, instead of just running a quick group by sql on input and task hashes.

Is there any solution for this? I.e. a way to make controller/task router to use _input_hash as a task id rather than _task_hash?

Hi @nrodnova,

In order to determine if the question has been asked already you need to take into account both input and the kind of question that is being asked. You might ask your annotators to first annotate NER and then add textcat annotations to the same inputs. Excluding based on input only would prevent them from seeing the NER annotated examples. That's why the examples are excluded by _task_hash by default.
That, said you can change this behaviour by setting "exclude_by": "input" in the prodigy.json configuration file.

By default Prodigy rehashes when reading the dataset so it shouldn't be necessary to rehash it before saving to the DB if the dataset is going to be read again in another recipe such as review.
As explained above review has a specific way of rehashing tasks and inputs to make sure we are collecting the right versions depending whether the annotation was manual or binary.
However, if you want to leverage _task_hash for a custom function by querying the DB directly, then you'd have to take care of rehashing the tasks, yes.
Btw. have you seen Prodigy built-in inter-annotator agreement recipes?