set_hashes unpredicted behaviour

I'm currently trying to understand the behaviour of set_hashes in prodigy.util.

I have two examples:

eg1 = {'html': 'aaaaa', 'options': [{'id': 'a', 'text': 'A'}, {'id': 'b', 'text': 'B'}]}
eg2 = {'html': 'bbbbb', 'options': [{'id': 'a', 'text': 'A'}, {'id': 'b', 'text': 'B'}]}

I would expect in this case that both hashes are equal

task_hash1 = set_hashes(eg1, task_keys=('options',))['_task_hash']  #1438234412
task_hash2 = set_hashes(eg2, task_keys=('options',))['_task_hash'] #-158612798

How ever this is not the case...

Am I missing something here?

Hi! The task hash is based on the input hash (representing the original input) and the task keys (representing the question asked about the input). The html key is considered for the input hash, because that's different, so you end up with different task hashes as well. This is typically reasonable and the desired result, because you do want two different texts with the same options to receive different task hashes and be considered different.

Another thing to consider: Prodigy expects to always have at least one key that's available in the task and that can be used to determine whether to inputs match. So if you really wanted to consider both examples in your snippet identical (even though the contain different content that's likely shown to the annotator and no other matching content), you'd have to set input_keys=("options",) and task_keys=("options",).

Ah I see, that makes more sense!

Thanks for the response.

I do have another related question that you might be able to help me with.

In my situation we have HTML formatted in a certain way to just display the results to the user. Of course we would have to make sure that small changes in the formatting do not change the corresponding ids.

I currently put the ids of the text examples in the 'meta' fields of the json eg:

example = {
    "html": "some html", 
    "meta": {
       "id_1": 12345, 
       "id_2": 12345, 
    }
}

I would like to base the _task_hash and _input_hash off the ids in the meta field. In the documentation for set_hashes it seems as ignore_keys works on nested fields, however this doesnt seem to be the case for input_key and task_keys.

I was thinking of instead manually putting in the _input_hash instead of using the set_hashes method but then I would also have to manually find a way to compute the _task_hash...

Can you maybe recommend a better way of doing this or inform me whether I am thinking about the hashes in a wrong way?

Appreciate it!

Roberto

I think generating the input hash yourself makes sense here, especially since you know exactly what you want to base them on. You could even just use IDs as the hashes directly (the values should be integers but otherwise Prodigy doesn't care how they look).

About the _task_hash: Does your use case have a distinction betwen inputs and questions about the same inputs? For example, the same text with different suggestions or labels (which would receive the same input hashes but different task hashes)? If not, you can just use the same value for the task hash and input hash. Or you concatenate the input hash and some other ID to create the task hash.

Btw, alternatively, you could also use "meta" as the input/task key on set_hashes. This will base the hashes off the dumped JSON of that field.