Restarting prodigy on same dataset doesn't skip completed tasks (custom recipe)

joebuckle · October 4, 2022, 2:51pm

Good day,

We created a custom recipe for our use case. We have many audios to categorize in a session, so we complete a few tasks, then click on the save icon to save the annotations for those few. If we restart the server, running prodigy with the same arguments on the command line, the completed tasks are not skipped. It seems that it creates a new session on the same dataset. How do we skip the completed tasks, so we begin on the first one that we need to still review/annotate?

Thanks.

ryanwesslen · October 4, 2022, 6:57pm

hi @joebuckle!

Thanks for your question.

Are you familiar with Prodigy's hashing?

When a new example comes in, Prodigy assigns it two hashes: the input hash and the task hash . Both hashes are integers, so they can be stored as JSON with each task. Based on those hashes, Prodigy is able to determine whether two examples are entirely different, different questions about the same input, e.g. text, or the same question about the same input.

By default, Prodigy will exclude examples that have the same task hash, which equate to an input + the task. If you ran the exact same task (i.e., recipe), in theory Prodigy should skip it. I'm wondering if the task hash is different due to a slight tweak in your custom recipe.

You could try to modify your exclude_by in your configuration to "input". Prodigy will then skip (aka remove as dedup) based on input hash skipping. Since you're using a custom recipe, you can pass this in your recipe's return where you specify the "config" key values. This makes sure that this setting is only for your recipe and not in your global config.

 return {
        "dataset": dataset,          # the dataset to save annotations to
        "stream": stream,            # the stream of incoming examples
        "config": {
            "exclude_by": "input",  # default value is "task"
        }

Alternatively, since you have a custom recipe, instead of you can also use the hashes to do whatever logic you want. This may be a more ideal way to make sure you handle the logic how you want to.

If you do this, check out the set_hashes to make hashes for your new batch (stream) of data. Also, the get_input_hashes and get_task_hashes which can get all the hashes that exist for a given dataset.

Let me know if this helps or if you have any further questions!

joebuckle · October 4, 2022, 10:07pm

The "exclude_by" didn't work. It seems that we are getting a new session ID every time. Is that the cause of the issue?

We will try set_hashes. Thanks.

ryanwesslen · October 5, 2022, 4:33pm

To set unique session ID's, have you tried to set your session ID by multi-user sessions (i.e., add ?session_id=my_session_1 to your URL)? Otherwise, by default Prodigy creates timestamped based session ID's (hence why they're different).

However, this shouldn't affect either the input hash (which is created by any values of keys with "text", "image", "html", "input") or the task hash (which concatenates the input hash with a new hash for values from keys named "spans", "label", "options", "arcs")

Another possibility, here was an audio user who had issues with their hashing because they accidentally used the keyname "paths" (there are several others like this on the default ignore list). Low chance but definitely could be what's explaining this weird behavior.

Topic		Replies	Views
Restarting Prodigy with a new session usage , solved	9	1999	October 28, 2022
Continue to annotate same data in new session enhancement , done	19	4003	October 5, 2018
Avoid restarting from zero... enhancement , usage , solved	19	1982	May 10, 2018
Multi-user sessions and excluding annotations by session enhancement , usage , streams	7	1679	December 25, 2019
Duplicate tasks when starting a new session usage , custom	1	757	May 1, 2019

Restarting prodigy on same dataset doesn't skip completed tasks (custom recipe)

Related topics