prodigy.json excluded_by input seems not working

curious · July 22, 2020, 4:28pm

I'm using prodigy 1.9.9. I don't want to see the same text to be selected by pattern and model in textcat.teach so I modified prodigy.json and added "exclude_by": "input". That didn't work. I still see the same text be showed twice, one with the pattern number, one with the model score.

I exported the data and I could see that the input hash was the same for these 2 records. What am I missing? One more detail. In order to test this easily. I only annotated 5 records. The batch_size was set to 10. I wonder whether the size of the data set caused the problem.

Thank you.

justindujardin · July 24, 2020, 10:34pm

Hi @curious,

Have you tried updating to the latest build (currently 1.10.2)? There are fixes specific to input filtering in it that you might benefit from.

If upgrading is impossible, or doesn't help, I'd be happy to try and reproduce your problem. If that's the case could you provide the command-line arguments you're using, and your prodigy.json configuration (minus any sensitive fields like your database information.) If you could share a few of your examples (again with sensitive info redacted) that would be even better.

Thanks,
-Justin

curious · July 28, 2020, 12:31am

Installed 1.10.2. Still have the same problem. Here is my command line -
prodigy textcat.teach temp_db en_core_web_lg chat_sentence.jsonl --label mylabel -pt my_pattern.jsonl

Here is my prodigy.json
{
"custom_theme": {
"cardMaxWidth":1500},
"global_css":".prodigy-content {text-align: left; font-size: 12pt}",
"host": "0.0.0.0",
"show_flag": false,
"batch_size": 10,
"exclude_by": "input",
}

There is nothing special about my text at all. You should be able to reproduce it.

ines · July 28, 2020, 1:40pm

Sorry I only noticed this now: setting "exclude_by": "input" in textcat.teach is usually not something we'd recommend because it can make the active learning very ineffective if you're annotating with more than one label. The exclude_by setting is also mostly intended to exclude annotations present in the existing dataset, not to exclude examples within the same stream.

You could add your own logic that hashes all examples and filters duplicates before sending them out. You could also change the task_hash_keys=("label",), on the pattern matcher to an empty list, so it doesn't use the label to create hashes for detecting duplicates.

In general, though, if you have a lot of overlaps between the pattern matches and the model suggestions, it indicates that the patterns aren't really making a big difference.

curious · July 28, 2020, 7:50pm

Thank you for the clarification.

In case, I'd use the match recipe first. Build a model from the match output. Then use textcat.teach to label more cases.

justindujardin · July 28, 2020, 10:42pm

@curious I found a bug where exclude_by input would not work in 1.10.2, which could be related to the trouble you're having. If you'd be interested in trying out a beta version to see if it resolves your problem, send me an email at justin@explosion.ai

curious · August 3, 2020, 6:58pm

Thank you. Your fix worked!

justindujardin · August 3, 2020, 7:09pm

Thanks for the update. The fixes are released as part of 1.10.3 now!

Topic		Replies	Views
1.10.4 prodigy.json exclude_by bug? textcat , solved	5	682	November 10, 2020
textcat.teach repeating data with --exclude flag set and trained model in the loop usage , textcat , solved	9	744	September 25, 2019
Textcat with customer sorter didn't exclude dataset textcat	1	389	March 20, 2020
textcat.teach - Patterns not filtering Label enhancement , textcat , done , solved	8	744	January 11, 2019
--exclude in textcat teach is not working as expected. textcat , more-info-needed	2	397	December 15, 2020

prodigy.json excluded_by input seems not working

Related topics