1.10.4 prodigy.json exclude_by bug?

curious · October 30, 2020, 3:36pm

Hi,
I've reported this problem for 1.10.3 and I thought it was fixed. But I saw the problem again in 1.10.4. I used the textcat.teach recipe. I observed the following lines in the log -

My configuration 'exclude_by: input' was read
15:07:55: CONTROLLER: Initialising from recipe
{'batch_size': 100, 'dataset': 'temp', 'db': None, 'exclude': 'input', 'filters': [{'name': 'RelatedSessionsFilter', 'cache_size': 10}], 'max_sessions': 10, 'overlap': True, 'self': <prodigy.components.feeds.SessionFeed object at 0x7f9f24f8e350>, 'stream': <prodigy.components.sorters.ExpMovingAverage object at 0x7f9f24f88410>, 'validator': <prodigy.components.validate.Validator object at 0x7f9f24f9d190>, 'view_id': 'classification'}
However later, I saw 'exclude by input' and 'by task' both were used in the filter. Could that be wrong?
15:07:55: FILTER: Filtering duplicates from stream
{'by_input': True, 'by_task': True, 'stream': <generator object at 0x7f9f25384910>, 'warn_fn': <bound method Printer.warn of <wasabi.printer.Printer object at 0x7f9fcc53cdd0>>, 'warn_threshold': 0.4}
In the tasks, I saw 2 instances were selected although they had the same input_hash
{'answer': 'accept', 'text': ... for you only USD...', 'meta': {'event_id': 'masked', 'score': 0.14207322895526886}, '_input_hash': -301208957, '_task_hash': -519955208, 'label': 'insider', 'score': 0.14207322895526886, 'priority': 0.14207322895526886, 'spans': [], '_session_id': None, '_view_id': 'classification'},

{'answer': 'accept', 'text': '... for you only USD...', 'meta': {'event_id': 'masked', 'pattern': '389'}, '_input_hash': -301208957, '_task_hash': -2036512131, 'spans': [{'text': 'for you only', 'start': 18, 'end': 30, 'pattern': -149218508}], 'label': 'insider', '_session_id': None, '_view_id': 'classification'},

ines · November 2, 2020, 9:11am

Hi! Are you customising anything in a custom recipe or your prodigy.json?

I'm confused where this comes from This is the "exclude" value returned by the custom recipe, and it should be a list of dataset names or None. The exclude_by setting is specified in the "config" returned by the recipe.

Edit: Ah, looked at the wrong log statement. This makes sense and indicates it's correctly passed on.

These are the settings for the filter_duplicates filter that runs automatically on the raw source data when it's loaded. It's not related to exclusion of existing annotations – the mechanisms use similar names because it's a similar principle (which hashes should be used when filtering duplicates).

curious · November 3, 2020, 1:24am

No. I don't have any customization except the host configuration in prodigy.json. Here is my prodigy.json

{
"exclude_by": "input",
"custom_theme": {"cardMaxWidth":1500},
"global_css":".prodigy-content {text-align: left; font-size: 12pt}",
"host": "0.0.0.0",
"batch_size": 10,
"feed_overlap": true,
}

ines · November 3, 2020, 9:31am

Thanks, this is helpful! I also realised I looked at the wrong log statement so the exclude: "input" above does make sense and indicates that it's set correctly. I'll look into this!

Is this important for your use case, i.e. are you using named sessions? If not, can you try set this to false?

curious · November 3, 2020, 2:41pm

Yes, I'm using named session.

I've found that if I add the following filter I can prevent the same sentence from showing for the pattern and the model for the textcat.teach.

#filter out sentences are picked by the matcher and the model
stream = filter_duplicates(stream, by_input=True, by_task=False)

ines · November 10, 2020, 3:23am

Thanks for the update – I think I understand what's happening now: the exclude_by setting refers to the exclusion strategy for examples that are already in the dataset, not examples within the same stream. So if you set "exclude_by" to "input", you won't see a suggestion again if the text is already in the dataset (even if it's with a different label).

If you filter duplicates out of your stream, the filtering will be applied at runtime to whatever is coming in.

Are you sure you want to filter by input, though? This would mean that if you have multiple categories, you'd only ever see a text once and would never get to see suggestions for a different label again. Instead, maybe you could make sure that all examples have hashes and then filter by task hash? This would mean that you don't see the same suggestion for a label twice, but still get different label suggestions.

Topic		Replies	Views
prodigy.json excluded_by input seems not working done , streams	7	614	August 3, 2020
Exclude by task hash does not work bug , textcat	3	512	September 1, 2022
Exclude for custom_recipes - what am I missing? usage , done , solved	7	1965	July 29, 2020
Exclude flag in custom recipe not excluding examples usage , solved	2	476	November 30, 2020
Exclude not functioning / duplicate tasks done , streams	6	1694	July 21, 2020

1.10.4 prodigy.json exclude_by bug?

Related topics