prodigy.json excluded_by input seems not working

I'm using prodigy 1.9.9. I don't want to see the same text to be selected by pattern and model in textcat.teach so I modified prodigy.json and added "exclude_by": "input". That didn't work. I still see the same text be showed twice, one with the pattern number, one with the model score.

I exported the data and I could see that the input hash was the same for these 2 records. What am I missing? One more detail. In order to test this easily. I only annotated 5 records. The batch_size was set to 10. I wonder whether the size of the data set caused the problem.

Thank you.

Hi @curious,

Have you tried updating to the latest build (currently 1.10.2)? There are fixes specific to input filtering in it that you might benefit from.

If upgrading is impossible, or doesn't help, I'd be happy to try and reproduce your problem. If that's the case could you provide the command-line arguments you're using, and your prodigy.json configuration (minus any sensitive fields like your database information.) If you could share a few of your examples (again with sensitive info redacted) that would be even better.

Thanks,
-Justin

Installed 1.10.2. Still have the same problem. Here is my command line -
prodigy textcat.teach temp_db en_core_web_lg chat_sentence.jsonl --label mylabel -pt my_pattern.jsonl

Here is my prodigy.json
{
"custom_theme": {
"cardMaxWidth":1500},
"global_css":".prodigy-content {text-align: left; font-size: 12pt}",
"host": "0.0.0.0",
"show_flag": false,
"batch_size": 10,
"exclude_by": "input",
}

There is nothing special about my text at all. You should be able to reproduce it.

Sorry I only noticed this now: setting "exclude_by": "input" in textcat.teach is usually not something we'd recommend because it can make the active learning very ineffective if you're annotating with more than one label. The exclude_by setting is also mostly intended to exclude annotations present in the existing dataset, not to exclude examples within the same stream.

You could add your own logic that hashes all examples and filters duplicates before sending them out. You could also change the task_hash_keys=("label",), on the pattern matcher to an empty list, so it doesn't use the label to create hashes for detecting duplicates.

In general, though, if you have a lot of overlaps between the pattern matches and the model suggestions, it indicates that the patterns aren't really making a big difference.

Thank you for the clarification.

In case, I'd use the match recipe first. Build a model from the match output. Then use textcat.teach to label more cases.

@curious I found a bug where exclude_by input would not work in 1.10.2, which could be related to the trouble you're having. If you'd be interested in trying out a beta version to see if it resolves your problem, send me an email at justin@explosion.ai

Thank you. Your fix worked!

1 Like

Thanks for the update. The fixes are released as part of 1.10.3 now!