function set_hashes

Hello,

I am facing compatibility issues with prodigy package and another processing project. I use prodigy for NER tagging and would like to only use set_hashes function on the other processing project. Could you describe how you hash text in this function so I can find recalculate text hashes in the other project without building prodigy package?

Thank you,

Vincent

Hi @Vincent!

set_hashes calls get_hash (from prodigy.util import get_hash), which will set input_hash based on any values for keys "text", "image", "html", "input" using murmurhash.

task_hash is similar but will prefix the task with the input_hash and a new murmurhash hash for values with keys "spans", "label", "options", "arcs".

Also, it's important to be aware of a small quirk if you have any keys named in the ignore list. See this post for more details:

Hopefully this helps!

Thanks for your answer. I tried to install murmurhash but still could not get the same results.

version prodigy==1.10.7
version murmurhash==1.0.9

import murmurhash
murmurhash.hash("foo")
>> -156908512
#VERSUS
from prodigy.util import set_hashes
set_hashes({"text":"foo"})
>> {'text': 'foo', '_input_hash': 1503402072, '_task_hash': -1940932841}

Would you mind sharing your murmurhash hashing?

Thank you in advance for your help.

Vincent

There's some extra stuff happening in the set_hashes function. In particular, it uses the get_hash function which internally also encodes an optional prefix as well as the key of the dictionary.

import srsly 
import murmurhash 
from prodigy.util import get_hash 

task = {"text": "hello"}
keys = ("text", )
prefix = ""

prodigy_hash = get_hash(task, keys=keys, prefix="", ignore=tuple())
# 1832607575

But here's what we send into murmurhash inside of the that function:

murmurhash.hash('text="hello"')
# 1832607575

In general though, its much safer to use the get_hash function directly instead of building your own thing. There may be hashing edge-cases in the future that we might abstract away in the get_hash function. The function is public, but the implementation could theoretically change in the future.

Thank you very much for your answer. This is very helpful. Can I ask you the same for _task_hash calculation using murmuhash library and I will be ready to move forward!

Have a good day.

Vincent

The task hashes are also set with the get_hash function internally. The string is sorted by key beforehand.

Is there a reason why you can't use the get_hash function and must use murmurhash directly?

Hi Koaning,

I have a side project for statistics in the database using dependencies that are not compatible with prodigy lib. The only function from prodigy lib that I need in the project is get_hash or set_hashes. That's why I am asking what's behind these functions :slight_smile:

Here's the implementation that we currently use. Note! This implementation might change in the future, but it is what's used right now.

import srsly
from prodigy.util import IGNORE_HASH_KEYS 

def get_hash(
    task,
    keys,
    prefix= "",
    *,
    ignore=IGNORE_HASH_KEYS,
):
    """Get hash for a task based on task keys.
    task (dict): The task to hash.
    keys (tuple): The keys to include.
    prefix (str): Optional prefix to add. For task hashes, they input hash
        is used as the prefix.
    ignore (list): Optional list of keys to ignore. The hash will be insensitive
        to the presence and values of the ignored keys, anywhere in the object.
        Defaults to ('score', 'rank', 'model').
    RETURNS (str): The hash.
    """
    task = _filter_keys(task, set(ignore))
    values = [
        f"{key}={srsly.json_dumps(task[key], sort_keys=True)}"
        for key in keys
        if key in task
    ]
    values = "".join(values)
    if not values:
        values = srsly.json_dumps(task, sort_keys=True)
    values = str(prefix) + values
    hash_ = murmurhash.hash(values)
    return hash_