Inconsistent hashing

Hi, I recently came across an issue with input hashes while processing my data.

The below code should be useful for understanding my issue:

>>> eg1 = json.loads('''{"text":"abc"}''')
>>> eg2 = json.loads('''{"text":"abc", "image":"abc.jpg"}''')
>>>
>>> hash1 = set_hashes(eg1, input_keys=("text"), overwrite=True)
>>> hash2 = set_hashes(eg2, input_keys=("text"), overwrite=True)
>>>
>>> hash1['_input_hash']
-1831620795
>>> hash2['_input_hash']
-1420417407

As I understand, the input_keys parameter of set_hashes is meant to indicate which key(s) of the data to use for computing the input hash.

For both examples, I gave data with the same text and set the input_keys to only target "text"
However, it seems that the presence of the image key changes the hash somehow. Shouldn't both hashes be the same?

P.S.: Here is how this issue affects me

I annotated 500 tasks manually (using a slightly customized ner.manual) interface.

Then I created a model on those 500 examples and set out to annotate 500 more using ner.correct
To do so, I db-out dataset_500 and db-in dataset_1000 and finally ner.correct with dataset_1000 and input as 1-1000.jsonl

I thought the first 500 examples wouldn't be repeated, but there is a difference in the input hashes due to the file imported in dataset_1000 containing an "image" key while 1-1000.jsonl does not. This leads to prodigy treating them as different inputs and repeating the first 500 examples.

I think there's a very subtle typo in the example that ends up causing the behaviour you're seeing: the input_keys argument takes a list or tuple of keys, but ("text") is actually just the string "text" – so you either want a tuple ("text",) or a list ["text"].

If the provided keys are none, not valid or not found in the current JSON task, Prodigy falls back to hashing the whole dumped JSON, so you never end up with all tasks receiving the exact same hash by accident. So that's what's happening here and what causes the image key to have an impact after all.

If I change the example to the following, it produces the expected result (-1209601387 for both hashes):

hash1 = set_hashes(eg1, input_keys=("text",), overwrite=True)
hash2 = set_hashes(eg2, input_keys=("text",), overwrite=True)
1 Like

Alright, so I went ahead and separated out the duplicates in another file (and also fixed its input hash) using the following script:

TASKS1 = {}
TASKS2 = {}
with open('./dataset_1_1000.jsonl') as f:
    for line in f:
        task = json.loads(line.rstrip('\n'))
        task = set_hashes(task, input_keys=('text',), overwrite=True)
        
        if task["meta"]["file_num"] not in TASKS1:
            TASKS1[task["meta"]["file_num"]] = task
        else:
            TASKS2[task["meta"]["file_num"]] = task

with open('./results/result_1_1000_part1.jsonl', 'w') as f:
    f.write('\n'.join([json.dumps(tv) for tv in TASKS1.values()]))

with open('./results/result_1_1000_part2.jsonl', 'w') as f:
    f.write('\n'.join([json.dumps(tv) for tv in TASKS2.values()]))

Since I want to combine the two, I think the review recipe is the one I should use. However, I see that review accepts only datasets as input.

Is there a way to use the .jsonl files directly for review? Or do I have to db-in the files and then use the new datasets for the same?