Textcat with customer sorter didn't exclude dataset

I'm using Prodigy 1.9.8. I've updated the textcat to only return the data with score higher than 0.7. I used it to teach and exclude some dataset ID. I've found data which should be excluded data were still displayed during the teach. Is it a bug in Prodigy or something is missing in my custom sorter?

I used the following command to teach -

prodigy textcat.teach --label lab db_model_2 db2 /tmp/trained_model/ ~/new_dataset.jsonl -e db_pattern,db1

The db_pattern dataset was created by the new match recipe. I compared the output of this teach db_model2 and db_pattern. The same text has the same input_hash, but the task_hash is different. I wonder whether that's the problem. Here is the custom_sort method in the textcat -

def custom_sorter(scored_examples):
for score, example in scored_examples:
    # your own logic here to decide whether to send out an
    # example for annotation
    if score > 0.70:
        yield example

Yes, the _task_hash value is what determines the exclusion by default – two tasks are considered to be different questions if they have differen task hashes. (If their input hashes are the same, they're considered different questions about the same input.) You can read more about this here.

So I think the hashes in your dataset were created by taking the text and the matches into account (which makes sense in context of the match recipe). So there are two things you can do in your custom recipe:

  • Rehash the examples with set_hashes and overwrite=True, and the keys you want to consider (probably text for input hash and label for task hash)?
  • Set "exclude_by": "input" in the "config" returned by your recipe, to treat examples with the same input hashes as duplicates. (Note that the exclude logic is only applied after the recipe has started, so you'd still see duplicates in your sorter etc. – they just won't be shown in the app when the server starts and puts together the stream.)