I am facing compatibility issues with prodigy package and another processing project. I use prodigy for NER tagging and would like to only use set_hashes function on the other processing project. Could you describe how you hash text in this function so I can find recalculate text hashes in the other project without building prodigy package?
But here's what we send into murmurhash inside of the that function:
In general though, its much safer to use the get_hash function directly instead of building your own thing. There may be hashing edge-cases in the future that we might abstract away in the get_hash function. The function is public, but the implementation could theoretically change in the future.
I have a side project for statistics in the database using dependencies that are not compatible with prodigy lib. The only function from prodigy lib that I need in the project is get_hash or set_hashes. That's why I am asking what's behind these functions
Here's the implementation that we currently use. Note! This implementation might change in the future, but it is what's used right now.
from prodigy.util import IGNORE_HASH_KEYS
"""Get hash for a task based on task keys.
task (dict): The task to hash.
keys (tuple): The keys to include.
prefix (str): Optional prefix to add. For task hashes, they input hash
is used as the prefix.
ignore (list): Optional list of keys to ignore. The hash will be insensitive
to the presence and values of the ignored keys, anywhere in the object.
Defaults to ('score', 'rank', 'model').
RETURNS (str): The hash.
task = _filter_keys(task, set(ignore))
values = [
for key in keys
if key in task
values = "".join(values)
if not values:
values = srsly.json_dumps(task, sort_keys=True)
values = str(prefix) + values
hash_ = murmurhash.hash(values)