Exclude by task hash does not work

Hi Prodigy Team,

We have a custom TextCat recipe that we don't want to rehash the tasks by setting the get_stream rehash parameter to False.

    stream = get_stream(
        source, loader=loader, rehash=False, dedup=True, input_key="text"
    )

We disabled the rehash because the hashing has been done by the other service when preparing the source.jsonl.

{"_input_hash": -1512905942, "_task_hash": -972217336, "task_key": "MULTILABEL_OC_Product / Brand_0", "text": "Excellent!", "meta": {"category": "FACE", "segment": "FACE CARE", "brand": "OLAY", "product": "Olay Regenerist Whip 50ml", "rating": 5, "sentence_id": "www.superdrug.com-107254778-8001090875266-Excellent!_1", "phrase_id": "www.superdrug.com-107254778-8001090875266-Excellent!_1_0_10", "ooc": "Product / Brand", "sentence": "Excellent!", "multilabel": true, "oc": "Formula"}}
{"_input_hash": -1778974475, "_task_hash": 1591342871, "task_key": "MULTILABEL_OC_Product / Brand_0", "text": "I love so much .", "meta": {"category": "FACE", "segment": "FACE CARE", "brand": "OLAY", "product": "Olay Regenerist Whip 50ml", "rating": 5, "sentence_id": "www.superdrug.com-107254778-8001090875266-Excellent!_2", "phrase_id": "www.superdrug.com-107254778-8001090875266-Excellent!_2_0_16", "ooc": "Product / Brand", "sentence": "I love so much .", "multilabel": true, "oc": "Formula"}}
{"_input_hash": 139019224, "_task_hash": 1049464764, "task_key": "MULTILABEL_OC_Product / Brand_0", "text": "You don't need too much, a little goes a long way and it glides on really smoothly without dragging yourcskin.", "meta": {"category": "FACE", "segment": "FACE CARE", "brand": "OLAY", "product": "Olay Regenerist Whip 50ml", "rating": 5, "sentence_id": "www.superdrug.com-107254778-8001090875266-Excellent!_3", "phrase_id": "www.superdrug.com-107254778-8001090875266-Excellent!_3_0_110", "ooc": "Product / Brand", "sentence": "You don't need too much, a little goes a long way and it glides on really smoothly without dragging yourcskin.", "multilabel": true, "oc": "Formula"}}
{"_input_hash": -1512905942, "_task_hash": 75143544, "task_key": "MULTILABEL_OC_Product / Brand_1", "text": "Excellent!", "meta": {"category": "FACE", "segment": "FACE CARE", "brand": "OLAY", "product": "Olay Regenerist Whip 50ml", "rating": 5, "sentence_id": "www.superdrug.com-107254778-8001090875266-Excellent!_1", "phrase_id": "www.superdrug.com-107254778-8001090875266-Excellent!_1_0_10", "ooc": "Product / Brand", "sentence": "Excellent!", "multilabel": true, "oc": "Formula"}}
{"_input_hash": -1778974475, "_task_hash": 1758367154, "task_key": "MULTILABEL_OC_Product / Brand_1", "text": "I love so much .", "meta": {"category": "FACE", "segment": "FACE CARE", "brand": "OLAY", "product": "Olay Regenerist Whip 50ml", "rating": 5, "sentence_id": "www.superdrug.com-107254778-8001090875266-Excellent!_2", "phrase_id": "www.superdrug.com-107254778-8001090875266-Excellent!_2_0_16", "ooc": "Product / Brand", "sentence": "I love so much .", "multilabel": true, "oc": "Formula"}}
{"_input_hash": 139019224, "_task_hash": -2001281296, "task_key": "MULTILABEL_OC_Product / Brand_1", "text": "You don't need too much, a little goes a long way and it glides on really smoothly without dragging yourcskin.", "meta": {"category": "FACE", "segment": "FACE CARE", "brand": "OLAY", "product": "Olay Regenerist Whip 50ml", "rating": 5, "sentence_id": "www.superdrug.com-107254778-8001090875266-Excellent!_3", "phrase_id": "www.superdrug.com-107254778-8001090875266-Excellent!_3_0_110", "ooc": "Product / Brand", "sentence": "You don't need too much, a little goes a long way and it glides on really smoothly without dragging yourcskin.", "multilabel": true, "oc": "Formula"}}
{"_input_hash": 139019224, "_task_hash": 541961746, "task_key": "MULTILABEL_SPAN_1", "text": "You don't need too much, a little goes a long way and it glides on really smoothly without dragging yourcskin.", "meta": {"category": "FACE", "segment": "FACE CARE", "brand": "OLAY", "product": "Olay Regenerist Whip 50ml", "rating": 5, "sentence_id": "www.superdrug.com-107254778-8001090875266-Excellent!_3", "phrase_id": "www.superdrug.com-107254778-8001090875266-Excellent!_3_0_110", "ooc": "Price", "sentence": "You don't need too much, a little goes a long way and it glides on really smoothly without dragging yourcskin.", "multilabel": true, "oc": "Price"}}
{"_input_hash": -1778974475, "_task_hash": 2130407409, "task_key": "MULTILABEL_OC_Effects_0", "text": "I love so much .", "meta": {"category": "FACE", "segment": "FACE CARE", "brand": "OLAY", "product": "Olay Regenerist Whip 50ml", "rating": 5, "sentence_id": "www.superdrug.com-107254778-8001090875266-Excellent!_2", "phrase_id": "www.superdrug.com-107254778-8001090875266-Excellent!_2_0_16", "ooc": "Effects", "sentence": "I love so much .", "multilabel": true, "oc": "Beauty: Brightening / Spot reduction"}}
{"_input_hash": -1778974475, "_task_hash": 770826728, "task_key": "MULTILABEL_OC_Effects_1", "text": "I love so much .", "meta": {"category": "FACE", "segment": "FACE CARE", "brand": "OLAY", "product": "Olay Regenerist Whip 50ml", "rating": 5, "sentence_id": "www.superdrug.com-107254778-8001090875266-Excellent!_2", "phrase_id": "www.superdrug.com-107254778-8001090875266-Excellent!_2_0_16", "ooc": "Effects", "sentence": "I love so much .", "multilabel": true, "oc": "Beauty: Brightening / Spot reduction"}}

In the recipe's return, we also defined that we want to exclude by the task hash instead of input hash.

return {
        "view_id": "choice" if has_options else "classification",
        "dataset": dataset,
        "stream": stream,
        "exclude": exclude,
        "validate_answer": validate_answer,
        "config": {
            "labels": labels,
            "choice_style": "single" if exclusive else "multiple",
            "choice_auto_accept": exclusive,
            "exclude_by": "task",
            "auto_count_stream": True,
        },
    }

However, it seems that the recipe still uses input hash instead of task hash to filter out duplicates when running the prodigy instance.

We expect that all of the 9 tasks will be shown to the annotators instead of 3 as in the source.json each task has a different task hash. However, only 3 are shown. It seems that the recipe uses input hash instead of task hash. I wonder whether this is a bug. Shouldn't it exclude by task hash as we have defined in the recipe configuration?

Thank you for helping us out :smiley:

I run prodigy with logging verbose and found this output:

14:03:10: PREPROCESS: Add multiple choice options for 3 labels
14:03:10: FILTER: Filtering duplicates from stream
{'by_input': True, 'by_task': True, 'stream': <generator object at 0x12bab53c0>, 'warn_fn': <bound method Printer.warn of <wasabi.printer.Printer object at 0x12aee4ed0>>, 'warn_threshold': 0.4}

Could that by_input be set to true the reason for this behavior? In that case, how can we set the by_input parameter to False?

Just to confirm, does this happen when the view_id is "classification" or also when it is "choice"?

Could you share the full call to the Prodigy terminal command so that we may reproduce what is happening on your machine?