1.10.4 prodigy.json exclude_by bug?

Hi,
I've reported this problem for 1.10.3 and I thought it was fixed. But I saw the problem again in 1.10.4. I used the textcat.teach recipe. I observed the following lines in the log -

  1. My configuration 'exclude_by: input' was read
    15:07:55: CONTROLLER: Initialising from recipe
    {'batch_size': 100, 'dataset': 'temp', 'db': None, 'exclude': 'input', 'filters': [{'name': 'RelatedSessionsFilter', 'cache_size': 10}], 'max_sessions': 10, 'overlap': True, 'self': <prodigy.components.feeds.SessionFeed object at 0x7f9f24f8e350>, 'stream': <prodigy.components.sorters.ExpMovingAverage object at 0x7f9f24f88410>, 'validator': <prodigy.components.validate.Validator object at 0x7f9f24f9d190>, 'view_id': 'classification'}

  2. However later, I saw 'exclude by input' and 'by task' both were used in the filter. Could that be wrong?
    15:07:55: FILTER: Filtering duplicates from stream
    {'by_input': True, 'by_task': True, 'stream': <generator object at 0x7f9f25384910>, 'warn_fn': <bound method Printer.warn of <wasabi.printer.Printer object at 0x7f9fcc53cdd0>>, 'warn_threshold': 0.4}

  3. In the tasks, I saw 2 instances were selected although they had the same input_hash
    {'answer': 'accept', 'text': ... for you only USD...', 'meta': {'event_id': 'masked', 'score': 0.14207322895526886}, '_input_hash': -301208957, '_task_hash': -519955208, 'label': 'insider', 'score': 0.14207322895526886, 'priority': 0.14207322895526886, 'spans': [], '_session_id': None, '_view_id': 'classification'},

{'answer': 'accept', 'text': '... for you only USD...', 'meta': {'event_id': 'masked', 'pattern': '389'}, '_input_hash': -301208957, '_task_hash': -2036512131, 'spans': [{'text': 'for you only', 'start': 18, 'end': 30, 'pattern': -149218508}], 'label': 'insider', '_session_id': None, '_view_id': 'classification'},

1 Like

Hi! Are you customising anything in a custom recipe or your prodigy.json?

I'm confused where this comes from :thinking: This is the "exclude" value returned by the custom recipe, and it should be a list of dataset names or None. The exclude_by setting is specified in the "config" returned by the recipe.

Edit: Ah, looked at the wrong log statement. This makes sense and indicates it's correctly passed on.

These are the settings for the filter_duplicates filter that runs automatically on the raw source data when it's loaded. It's not related to exclusion of existing annotations – the mechanisms use similar names because it's a similar principle (which hashes should be used when filtering duplicates).

No. I don't have any customization except the host configuration in prodigy.json. Here is my prodigy.json

{
"exclude_by": "input",
"custom_theme": {"cardMaxWidth":1500},
"global_css":".prodigy-content {text-align: left; font-size: 12pt}",
"host": "0.0.0.0",
"batch_size": 10,
"feed_overlap": true,
}

Thanks, this is helpful! I also realised I looked at the wrong log statement so the exclude: "input" above does make sense and indicates that it's set correctly. I'll look into this!

Is this important for your use case, i.e. are you using named sessions? If not, can you try set this to false?

Yes, I'm using named session.

I've found that if I add the following filter I can prevent the same sentence from showing for the pattern and the model for the textcat.teach.

#filter out sentences are picked by the matcher and the model
stream = filter_duplicates(stream, by_input=True, by_task=False)

Thanks for the update – I think I understand what's happening now: the exclude_by setting refers to the exclusion strategy for examples that are already in the dataset, not examples within the same stream. So if you set "exclude_by" to "input", you won't see a suggestion again if the text is already in the dataset (even if it's with a different label).

If you filter duplicates out of your stream, the filtering will be applied at runtime to whatever is coming in.

Are you sure you want to filter by input, though? This would mean that if you have multiple categories, you'd only ever see a text once and would never get to see suggestions for a different label again. Instead, maybe you could make sure that all examples have hashes and then filter by task hash? This would mean that you don't see the same suggestion for a label twice, but still get different label suggestions.