function set_hashes

Vincent · October 4, 2022, 9:12pm

Hello,

I am facing compatibility issues with prodigy package and another processing project. I use prodigy for NER tagging and would like to only use set_hashes function on the other processing project. Could you describe how you hash text in this function so I can find recalculate text hashes in the other project without building prodigy package?

Thank you,

Vincent

ryanwesslen · October 5, 2022, 4:54pm

Hi @Vincent!

set_hashes calls get_hash (from prodigy.util import get_hash), which will set input_hash based on any values for keys "text", "image", "html", "input" using murmurhash.

task_hash is similar but will prefix the task with the input_hash and a new murmurhash hash for values with keys "spans", "label", "options", "arcs".

Also, it's important to be aware of a small quirk if you have any keys named in the ignore list. See this post for more details:

Hopefully this helps!

Vincent · October 26, 2022, 4:15pm

Thanks for your answer. I tried to install murmurhash but still could not get the same results.

version prodigy==1.10.7
version murmurhash==1.0.9

import murmurhash
murmurhash.hash("foo")
>> -156908512
#VERSUS
from prodigy.util import set_hashes
set_hashes({"text":"foo"})
>> {'text': 'foo', '_input_hash': 1503402072, '_task_hash': -1940932841}

Would you mind sharing your murmurhash hashing?

Thank you in advance for your help.

Vincent

koaning · October 27, 2022, 1:47pm

There's some extra stuff happening in the set_hashes function. In particular, it uses the get_hash function which internally also encodes an optional prefix as well as the key of the dictionary.

import srsly 
import murmurhash 
from prodigy.util import get_hash 

task = {"text": "hello"}
keys = ("text", )
prefix = ""

prodigy_hash = get_hash(task, keys=keys, prefix="", ignore=tuple())
# 1832607575

But here's what we send into murmurhash inside of the that function:

murmurhash.hash('text="hello"')
# 1832607575

In general though, its much safer to use the get_hash function directly instead of building your own thing. There may be hashing edge-cases in the future that we might abstract away in the get_hash function. The function is public, but the implementation could theoretically change in the future.

Vincent · October 28, 2022, 8:07am

Thank you very much for your answer. This is very helpful. Can I ask you the same for _task_hash calculation using murmuhash library and I will be ready to move forward!

Have a good day.

Vincent

koaning · October 31, 2022, 3:12pm

The task hashes are also set with the get_hash function internally. The string is sorted by key beforehand.

Is there a reason why you can't use the get_hash function and must use murmurhash directly?

Vincent · November 3, 2022, 9:12am

Hi Koaning,

I have a side project for statistics in the database using dependencies that are not compatible with prodigy lib. The only function from prodigy lib that I need in the project is get_hash or set_hashes. That's why I am asking what's behind these functions

koaning · November 4, 2022, 12:53pm

Here's the implementation that we currently use. Note! This implementation might change in the future, but it is what's used right now.

import srsly
from prodigy.util import IGNORE_HASH_KEYS 

def get_hash(
    task,
    keys,
    prefix= "",
    *,
    ignore=IGNORE_HASH_KEYS,
):
    """Get hash for a task based on task keys.
    task (dict): The task to hash.
    keys (tuple): The keys to include.
    prefix (str): Optional prefix to add. For task hashes, they input hash
        is used as the prefix.
    ignore (list): Optional list of keys to ignore. The hash will be insensitive
        to the presence and values of the ignored keys, anywhere in the object.
        Defaults to ('score', 'rank', 'model').
    RETURNS (str): The hash.
    """
    task = _filter_keys(task, set(ignore))
    values = [
        f"{key}={srsly.json_dumps(task[key], sort_keys=True)}"
        for key in keys
        if key in task
    ]
    values = "".join(values)
    if not values:
        values = srsly.json_dumps(task, sort_keys=True)
    values = str(prefix) + values
    hash_ = murmurhash.hash(values)
    return hash_

Topic		Replies	Views
How does "set_hashes" work ? usage , solved	2	425	June 3, 2022
set_hashes unpredicted behaviour usage , solved	3	551	November 9, 2020
Logic behind hash keys (in relation to REVIEW API)	4	10	October 16, 2024
Inconsistent hashing usage , solved , streams	2	508	December 15, 2020
set_hashes produces "this was automatically assigned" warning usage	4	225	January 14, 2024

function set_hashes

Related topics