set_hashes produces "this was automatically assigned" warning

Hi, I've been using the set_hashes function like so:

set_hashes({"text":"hello world"}) # generates _input_hash and _task_hash

This is passed to the stream:

return {
    "view_id": "spans_manual",
    "dataset": dataset,
    "stream": (set_hashes(obj) for obj in stream),
    "config": {
        "lang": "en",
        "labels": labels,
        "exclude_by": "input",
    },
}

But I get the following warning in the prompt:

⚠ Prodigy automatically assigned an input/task hash because it was
missing. This automatic hashing will be deprecated as of Prodigy v2 because it
can lead to unwanted duplicates in custom recipes if the examples deviate from
the default assumptions. More information can found on the docs:
https://prodi.gy/docs/api-components#set_hashes

Why does it say the hash was automatically assigned when the above code seems to have already addressed the warning?

Hey @mv3 ,

That warning is, indeed, unexpected given you're applying the set_hashes to the stream.
I assume you're importing set_hashes from Prodigy like from prodigy import set_hashes?

Also, could you share how stream is created in your recipe?
Thanks!

I import via:

from prodigy.util import set_hashes # without .util, linter complains

Stream is created via a generator of dicts that I tried experimented passing to both prodigy.components.stream import get_stream and using the generator as is:

with msg.loading(f"Loading {selected}"):
        stream = asset.get_texts_meta(num_chunks=5, results_per_query=10, label=asset.material_path)
        stream = asset.get_stream(stream)  # results in a generator of dicts
        stream = get_stream(stream)  # from prodigy.components.stream
        stream = add_tokens(asset.nlp, stream)
        stream = add_options(stream)
        stream = (set_hashes(obj) for obj in stream)

Hi @mv3 ,

The get_stream function from the Stream component returns a Stream object (as opposed to a generator as it used to be). As soon as you start to iterate over a Stream object, it will do the check on the presence of the hash attributes.

In your case, you are iterating over the Stream object before calling the set_hashesfunction:

 stream = (set_hashes(obj) for obj in stream)

which triggers the warning.

The set_hashes functions belongs more to the older stream as generator API and it it is not really meant to be used with the new stream. Its function is now automated under the hood. We definitely missed this case from our design of the warning about the hashes and I see how it can be confusing.

So, for your particular case, you might as well skip the line:

 stream = (set_hashes(obj) for obj in stream)

as this is what will be done under the hood anyway.
If you do have meaningful hashes (e.g. generated by a function other than built-in set_hashes), they should be applied before calling get_stream or via Stream apply method.

In fact, the intended use is that all functions that modify the stream task by task are applied via apply method.
so for example:

stream = get_stream(my_source)
stream.apply(add_tokens, nlp=nlp, stream=stream)

Thank you @magdaaniol. I assumed based on the warning that when Prodigy v2 hits, that automatic hashing will be deprecated, hence the need to use a separate set_hashes() to be more explicit. Seems like I misunderstood it. Thank you!