set_hashes unpredicted behaviour

I'm currently trying to understand the behaviour of set_hashes in prodigy.util.

I have two examples:

eg1 = {'html': 'aaaaa', 'options': [{'id': 'a', 'text': 'A'}, {'id': 'b', 'text': 'B'}]}
eg2 = {'html': 'bbbbb', 'options': [{'id': 'a', 'text': 'A'}, {'id': 'b', 'text': 'B'}]}

I would expect in this case that both hashes are equal

task_hash1 = set_hashes(eg1, task_keys=('options',))['_task_hash']  #1438234412
task_hash2 = set_hashes(eg2, task_keys=('options',))['_task_hash'] #-158612798

How ever this is not the case...

Am I missing something here?

Hi! The task hash is based on the input hash (representing the original input) and the task keys (representing the question asked about the input). The html key is considered for the input hash, because that's different, so you end up with different task hashes as well. This is typically reasonable and the desired result, because you do want two different texts with the same options to receive different task hashes and be considered different.

Another thing to consider: Prodigy expects to always have at least one key that's available in the task and that can be used to determine whether to inputs match. So if you really wanted to consider both examples in your snippet identical (even though the contain different content that's likely shown to the annotator and no other matching content), you'd have to set input_keys=("options",) and task_keys=("options",).