Thanks for the question and sorry about the late reply.
We’ve actually been going back and forth on this, and the decision of whether to make this a feature or even the default behaviour (which is probably not a bad idea, as it makes re-annotating harder and can lead to longer loading times for large datasets). The setting also needs to be recipe-specific – for some recipes within the same workflow, you might want to use it, and for others, you might want to turn it off.
The good news is, all the underlying logic needed is already there: As the examples come in, Prodigy assigns two hashes to them – an input hash based on the task text or image and a task hash, based on the input and the additional features, like the spans or label.
So I’d suggest we add a filter function that filters out tasks with the same task hash that already exist in the current dataset, and that can be toggled on a per-recipe basis. In the meantime, you can also just implement this functionality via a custom recipe by adding an
on_load callback that gets the task hashes from the database, and modifies the stream so that tasks with the same task hash are filtered out:
from prodigy.recipes.ner import teach # import the ner.teach recipe
from prodigy.util import TASK_HASH_ATTR # the task hash attr constant ('_task_hash')
# you could also hard-code this, but using the constant is cleaner
def custom_recipe(dataset, model, source, loader): # etc.
# recipes are Python functions that return a dictionary of components,
# so you can just call them and receive back a dict to return or overwrite
components = teach(dataset, model, source=source, loader=loader)
# this function is called on load and gives you access to the controller,
# which includes the database
task_hashes = ctrl.db.get_task_hashes(dataset) # get task hashes
# overwrite the stream and filter out examples with task hashes that already exist
components['stream'] = (eg for eg in stream if eg[TASK_HASH_ATTR] not in task_hashes)
components['on_load'] = on_load # set an on_load option in the components
return components # return the dict of components
Of course, the above would also work for a custom recipe. For more details on how to wrap built-in recipes, see this comment.