hi @joebuckle!
Thanks for your question.
Are you familiar with Prodigy's hashing?
When a new example comes in, Prodigy assigns it two hashes: the input hash and the task hash . Both hashes are integers, so they can be stored as JSON with each task. Based on those hashes, Prodigy is able to determine whether two examples are entirely different, different questions about the same input, e.g. text, or the same question about the same input.
By default, Prodigy will exclude examples that have the same task hash, which equate to an input + the task. If you ran the exact same task (i.e., recipe), in theory Prodigy should skip it. I'm wondering if the task hash is different due to a slight tweak in your custom recipe.
You could try to modify your exclude_by
in your configuration to "input"
. Prodigy will then skip (aka remove as dedup) based on input hash skipping. Since you're using a custom recipe, you can pass this in your recipe's return
where you specify the "config"
key values. This makes sure that this setting is only for your recipe and not in your global config.
return {
"dataset": dataset, # the dataset to save annotations to
"stream": stream, # the stream of incoming examples
"config": {
"exclude_by": "input", # default value is "task"
}
Alternatively, since you have a custom recipe, instead of you can also use the hashes to do whatever logic you want. This may be a more ideal way to make sure you handle the logic how you want to.
If you do this, check out the set_hashes
to make hashes for your new batch (stream) of data. Also, the get_input_hashes
and get_task_hashes
which can get all the hashes that exist for a given dataset.
Let me know if this helps or if you have any further questions!