Continue to annotate same data in new session

ines · December 4, 2017, 9:26am

Thanks for the question and sorry about the late reply.

We’ve actually been going back and forth on this, and the decision of whether to make this a feature or even the default behaviour (which is probably not a bad idea, as it makes re-annotating harder and can lead to longer loading times for large datasets). The setting also needs to be recipe-specific – for some recipes within the same workflow, you might want to use it, and for others, you might want to turn it off.

The good news is, all the underlying logic needed is already there: As the examples come in, Prodigy assigns two hashes to them – an input hash based on the task text or image and a task hash, based on the input and the additional features, like the spans or label.

So I’d suggest we add a filter function that filters out tasks with the same task hash that already exist in the current dataset, and that can be toggled on a per-recipe basis. In the meantime, you can also just implement this functionality via a custom recipe by adding an on_load callback that gets the task hashes from the database, and modifies the stream so that tasks with the same task hash are filtered out:

import prodigy
from prodigy.recipes.ner import teach  # import the ner.teach recipe
from prodigy.util import TASK_HASH_ATTR  # the task hash attr constant ('_task_hash')
# you could also hard-code this, but using the constant is cleaner

@prodigy.recipe('custom-recipe')
def custom_recipe(dataset, model, source, loader):  # etc.
    # recipes are Python functions that return a dictionary of components,
    # so you can just call them and receive back a dict to return or overwrite
    components = teach(dataset, model, source=source, loader=loader)

    def on_load(ctrl):
        # this function is called on load and gives you access to the controller,
        # which includes the database
        task_hashes = ctrl.db.get_task_hashes(dataset)  # get task hashes
        # overwrite the stream and filter out examples with task hashes that already exist 
        components['stream'] = (eg for eg in stream if eg[TASK_HASH_ATTR] not in task_hashes)

    components['on_load'] = on_load  # set an on_load option in the components
    return components  # return the dict of components

Of course, the above would also work for a custom recipe. For more details on how to wrap built-in recipes, see this comment.

Topic		Replies	Views
Avoid restarting from zero... enhancement , usage , solved	19	2214	May 10, 2018
Restarting Prodigy with a new session usage , solved	9	2127	October 28, 2022
Multi-user sessions and excluding annotations by session enhancement , usage , streams	7	1752	December 25, 2019
Inconsistency Number of Annotated Data ner , textcat	10	113	November 27, 2024
Duplicated annotation when changing version ner , spacy	6	587	November 9, 2022

Continue to annotate same data in new session

Related topics