Combination of set_hashes(...) and exclude_by not working as expected

Hello,

I am using a custom recipe with custom UI (blocks). I load examples from a custom source (not using any of the standard loaders). My loading function is a normal Python generator.

Prodigy version:

friso@mine-too searchlib % pipenv run pip list | grep prodigy
prodigy               1.11.7

What I am trying to achieve is that Prodigy de-duplicates the examples based on the text attribute, not showing any duplicates to the labeller.

The outline of the loading function is this:

def _examples(file_path):
    for eg in load_from_source(file_path):
        eg = prodigy.set_hashes(
            eg,
            input_keys=("text"),
            ignore=(
                'score',
                'rank',
                'model',
                'source',
                '_view_id',
                '_session_id',
                'context_before',
                'context_after',
                'text_segments',
                'segment',
                'request_url'
                'flags',
                'link',
                'heading',
                'bold',
                'form')
            )
        
        print(eg) # For debugging, to confirm that the hash is set.

        yield eg

The load from source is a nested loop that loads files from disk, parses HTML, and cuts HTML docs into snippets based on some heuristic. The set of documents is fairly large (a crawl across many sites). Some of the additional attributes in the ignore list are required for properly rendering the example in the UI to provide relevant context to the labeller.

The recipe function is this:

@prodigy.recipe(
    "boilerplate",
    dataset=("Dataset to save to", "positional", None, str),
    file_path=("Path to crawled pages.", "positional", None, str)
)
def boilerplate(dataset, file_path):
    return {
        'dataset': dataset,
        'stream': _examples(file_path),
        'view_id': 'blocks',
        'config': {
            'blocks': [
                { 'view_id': 'choice', 'text': None},
                {
                    'view_id': 'html',
                    'html_template': """
                    <!-- omitted for brevity -->
                    """
                }
            ],
            'choice_auto_accept': True,
            'exclude_by': 'input'
        }
    }

I am expecting the combination of using set_hashes(...) on the example, and "exclude_by": "input" would prevent duplicates from being shown to the labeller. However, duplicates do appear in the UI.

Looking at the results after going through about 60 examples in the UI, I see this (the sort | uniq -c combo counts duplications of text+hash tuples):

friso@mine-too searchlib % pipenv run prodigy db-out test_bp | jq '[.text, ._input_hash] | join(" ==> ")' | sort | uniq -c | sort -nr | head -n5 
   2 "日本 (日本語) ==> 836771047"
   2 "Read More → ==> 983003608"
   2 "Privacy · Cookies · Disclaimer · ©2021 Adyen ==> 667585559"
   2 "Enterprise Sales Manager ==> -1244278858"
   1 "日本 (日本語) ==> -344476845"

On top of that, there are equal text strings that hash to different values, as can be seen from this:

friso@mine-too searchlib % pipenv run prodigy db-out test_bp | jq .text | sort | uniq -c | sort -nr | head -n10
   3 "日本 (日本語)"
   3 "Privacy · Cookies · Disclaimer · ©2021 Adyen"
   3 "Mexico (Español)"
   3 "Global (English)"
   3 "Brasil (Português)"
   2 "中国(简体中文)"
   2 "Česká republika (Čeština)"
   2 "Who you are"
   2 "Sverige (Svenska)"
   2 "Read More →"

friso@mine-too searchlib % pipenv run prodigy db-out test_bp | jq '[.text, ._input_hash] | join(" ==> ")' | grep 'Mexico (Español)'             
"Mexico (Español) ==> 645926660"
"Mexico (Español) ==> -215213997"
"Mexico (Español) ==> 1179272272"

It appears that my use of set_hashes(...) sometimes assigns a different hash to the same text example (the example dict may differ in other fields, of course).

To further test, I also ran a version where I bypass set_hashes(...), explicitly setting the _input_hash attribute directly using eg['_input_hash'] = hash(eg['text']). This prevents the issue of identical text yielding different hash values, but still shows duplicates in the UI:

friso@mine-too searchlib % pipenv run prodigy db-out test_bp | jq '[.text, ._input_hash] | join(" ==> ")' | sort | uniq -c | sort -nr | head -n10
   3 "No ==> 2325354343457029000"
   2 "Yes ==> 3868478061270322700"
   2 "Please select ==> -551837291774694700"
   2 "English ==> -6262055385746762000"
   2 "Email * ==> -6090938445933696000"
   2 "Blog ==> 7014436551561426000"
   1 "🥜 In a nutshell ==> -6426966859323529000"
   1 "📖 Picnic Perks ==> 968036584514992100"
   1 "⭐ About you ==> -8235567690865371000"
   1 "français ==> -2439657082349500400"

Finally, I created a contrived example of my setup that does not load real documents from disk, but instead generates a small set of about 100 examples using a permutation of a small number of strings. In this example, it does appear to work as expected.

Is there anything wrong with the way I've set this up? Any hints are much appreciated. For now, I am keeping an in-memory set of hashes in the generator, but obviously that does not survive server restarts.

Thanks for the very detailed analysis :pray: I've actually been wondering if there was a subtle bug in the exclude_by mechanism, but we weren't able to properly track it down. We're just about to release an alpha version with an updated feed implementation that I'm hoping will also solve this (assuming it's a bug). I'll keep you updated once the version is available for testing! In the meantime, this issue will provide a good test case.

As a workaround in the meantime, you could run db.get_input_hashes on startup to get all input hashes in the dataset and add them to your in-memory set.

Thank you. I will try that.