Review recipe question: where is the check if the task has been already reviewed?

nrodnova · August 15, 2024, 10:34pm

Hello! Sorry for bombarding you with questions - I just really dug in to Prodigy details recently, have been a relatively uneducated user until now. I am creating a custom review recipe, and trying to follow the existing review recipe logic. I am missing the piece where the recipe (or stream?) decides if the task has been already reviewed and exists in the target dataset. Could you please help? I know how to do it (check the hash in the target dataset), but I want to learn best practices - where it normally happens, and if there's already some logic for it anywhere. Thanks!

magdaaniol · August 17, 2024, 12:37pm

Hi @nrodnova and no worries! That's what the forum is for!

The mechanism responsible for excluding the already answered questions is, precisely, filtering on the task_hash.
It is not being done the recipe level, though.
When Controller is being instantiated, it adds the current dataset to the set of datasets that should be used for the filtering (this default behavior can be modified by setting auto_exclude_current to false in prodigy.json).
Then whenever a session asks for a new batch of questions, the filtering is applied. So the actual filtering happens on the Session object because it also needs to take into account the questions answered by this given session in the current process. What exactly to filter out also depends on the feed_overlap setting (if the overlap is set to true, only the hashes from he current session are excluded, otherwise, the hashes from all sessions are excluded).

In theory, if you are replicating the hashing behavior of the original review recipe, you shouldn't need to implement the exclusion of the already reviewed questions.

Just for completeness, whether to filter on the task or the input hash can also be configured via exclude_by config setting. Please see here for more details: Loaders and Input Data · Prodigy · An annotation tool for AI, Machine Learning & NLP

nrodnova · August 18, 2024, 7:06pm

Hi @magdaaniol Thank you very much for your response! It makes sense.

One more question about filtering on task_hash. Let's assume we are talking about ner.manual interface. If task_hash is calculated on spans and answers only (we don't have options in ner), is it really reliable to filter tasks only by span boundaries (assuming answers are mostly "accept")? For short texts and large datasets it's quite likely to have same task_hash for different input_hash, no? What is the reasoning for not including the original text (or anything else counted into input_hash)?

magdaaniol · August 19, 2024, 1:46pm

Hi @nrodnova ,

task_hash function does take into account the input_hash. You might have missed this detail from the docs?

Task keys are based on the input hash, plus optional features you’re annotating, like the label or spans.

A very good observation nonetheless!

nrodnova · August 19, 2024, 10:53pm

Good point!
I was looking at the set_hashes function task_keys parameter description:

Dictionary keys to consider for the task hash. Defaults to ("spans", "label", "options") .
So, is it always passed input_keys + task_keys parameters for the task hash calculation?

Also, one more hashing question. The way I often do ner and spans is pre-populating spans key with potential candidates (or sentences to look at), and then annotate with a different label (In ner mode I'd have to delete the highlighted value and re-select it with the appropriate label, in spans it's easier). Because I pass spans content in the input, the _task_hash is calculated, and it seems to not be recalculated before going to the database. So I am wondering if in your code overwrite is False and I need to do it in tne before_db callback myself, or I am missing something.

magdaaniol · August 20, 2024, 9:19am

Hi @nrodnova,

Yes, to be precise, it passes the already computed input_hash as string. Then, this input_hash string is used as prefix to the string concatenated from the values of the task_hash keyes, which is passed to the murmurhash function.
The core of the function is:

# pseudocode
def get_hash( task: Dict, keys: Iterable[str], prefix: str = "") -> int:
    values = [
        f"{key}={srsly.json_dumps(task[key], sort_keys=True)}"
        for key in keys
        if key in task
    ]
    values = "".join(values)
    values = str(prefix) + values # prefix would be the input_hash
    hash_ = murmurhash.hash(values)
    return hash_

About the default hashing in the recipes:
By default, the recipe rehashes the input stream (it's done inside the get_stream function with rehashset to True) and it does overwrite the hashes if they exist.

And yes, it doesn't get rehashed before saving to the DB. This mostly because the hashing logic is relevant for processing the input. When your annotations are used again in any of Prodigy recipes, they will be rehashed by default upon reading so the NER spans will take effect then.
If you're using standard key names and you don't need any custom filtering and you'll keep using these dataset within Prodigy, there's no need to rehash before writing to DB.

nrodnova · August 20, 2024, 11:40am

Thank you! Good to know. I was experimenting with the review recipe, and things didn't work for me because I didn't recalculate the hashes.

Topic		Replies	Views
Logic behind hash keys (in relation to REVIEW API)	4	21	October 16, 2024
Exclude by task hash does not work bug , textcat	3	513	September 1, 2022
Duplicate tasks when starting a new session usage , custom	1	760	May 1, 2019
Review dataset with multiple input hashes usage , best-practices , review	6	912	June 8, 2021
Avoid restarting from zero... enhancement , usage , solved	19	1982	May 10, 2018

Review recipe question: where is the check if the task has been already reviewed?

Related topics