How does "set_hashes" work ?


I've some trouble with set_hashes. Sometimes it give the same _input_hash although the input is different.
I did some tests but there are some results I don't understand. For example:

set_hashes({"path":"/home/user/audio.wav"}, input_keys=("path"))


set_hashes({"path":"/home/user/audio2.wav"}, input_keys=("path"))

give the same result:

'_input_hash': -1979175224, '_task_hash': -1952772507

But it gives different results if I add a new key to the task, for example, this :

set_hashes({"path":"/home/user/audio.wav", "text" : "test"}, input_keys=("path")) 

Doesn't give the same result. And if I add ignore=("text") I do not fall back on the previous result.
I don't know how to control what set_hashes will use to create the _input_hash...

Thanks you,

Hi Jim. You have stumbled apon a keyname that's in the default ignore list and I can totally imagine the confusion. Let's go over it step-by-step.

If the key is just text, nothing unexpected is happening.

from prodigy import set_hashes 

paths = ["a", "b", "c"]

[set_hashes({"text": p}) for p in paths]
# [{'text': 'a', '_input_hash': -1808989213, '_task_hash': -1053809049},
#  {'text': 'b', '_input_hash': 748838038, '_task_hash': -150095526},
#  {'text': 'c', '_input_hash': -600324218, '_task_hash': 1805703639}]

Same with the key "paths". Note the extra , by the way, I want the input_keys to be a tuple!

[set_hashes({"paths": p}, input_keys=("paths", )) for p in paths]
# [{'paths': 'a', '_input_hash': -895075480, '_task_hash': 652594778},
#  {'paths': 'b', '_input_hash': -501196447, '_task_hash': 144266571},
#  {'paths': 'c', '_input_hash': -451190452, '_task_hash': 884629090}]

But once I call it path, it's all different.

[set_hashes({"path": p}) for p in paths]
# [{'path': 'a', '_input_hash': -1979175224, '_task_hash': -1952772507},
#  {'path': 'b', '_input_hash': -1979175224, '_task_hash': -1952772507},
#  {'path': 'c', '_input_hash': -1979175224, '_task_hash': -1952772507}]

That's because of the ignore parameter (docs). This has the following defaults (copied from source code):

IGNORE_HASH_KEYS = ("score", "rank", "model", "source", "pattern", "priority", "path", VIEW_ID_ATTR, SESSION_ID_ATTR, ANNOTATOR_ID_ATTR, "answer")

Notice that path is in there.

We can fix it via;

[set_hashes({"path": p}, ignore=[]) for p in paths]
# [{'path': 'a', '_input_hash': -1378317001, '_task_hash': -1022972960},
#  {'path': 'b', '_input_hash': -730857728, '_task_hash': -945938464},
#  {'path': 'c', '_input_hash': -82246316, '_task_hash': -1983325868}]

Oh thanks you for this very quick response :slight_smile:
That was indeed the problem, as in the doc "path" was not explicitly marked and I did not look in the source code I missed it...
Thanks again !