Restarting prodigy on same dataset doesn't skip completed tasks (custom recipe)

Good day,

We created a custom recipe for our use case. We have many audios to categorize in a session, so we complete a few tasks, then click on the save icon to save the annotations for those few. If we restart the server, running prodigy with the same arguments on the command line, the completed tasks are not skipped. It seems that it creates a new session on the same dataset. How do we skip the completed tasks, so we begin on the first one that we need to still review/annotate?

Thanks.

hi @joebuckle!

Thanks for your question.

Are you familiar with Prodigy's hashing?

When a new example comes in, Prodigy assigns it two hashes: the input hash and the task hash . Both hashes are integers, so they can be stored as JSON with each task. Based on those hashes, Prodigy is able to determine whether two examples are entirely different, different questions about the same input, e.g. text, or the same question about the same input.

By default, Prodigy will exclude examples that have the same task hash, which equate to an input + the task. If you ran the exact same task (i.e., recipe), in theory Prodigy should skip it. I'm wondering if the task hash is different due to a slight tweak in your custom recipe.

You could try to modify your exclude_by in your configuration to "input". Prodigy will then skip (aka remove as dedup) based on input hash skipping. Since you're using a custom recipe, you can pass this in your recipe's return where you specify the "config" key values. This makes sure that this setting is only for your recipe and not in your global config.

 return {
        "dataset": dataset,          # the dataset to save annotations to
        "stream": stream,            # the stream of incoming examples
        "config": {
            "exclude_by": "input",  # default value is "task"
        }

Alternatively, since you have a custom recipe, instead of you can also use the hashes to do whatever logic you want. This may be a more ideal way to make sure you handle the logic how you want to.

If you do this, check out the set_hashes to make hashes for your new batch (stream) of data. Also, the get_input_hashes and get_task_hashes which can get all the hashes that exist for a given dataset.

Let me know if this helps or if you have any further questions!

The "exclude_by" didn't work. It seems that we are getting a new session ID every time. Is that the cause of the issue?

We will try set_hashes. Thanks.

To set unique session ID's, have you tried to set your session ID by multi-user sessions (i.e., add ?session_id=my_session_1 to your URL)? Otherwise, by default Prodigy creates timestamped based session ID's (hence why they're different).

However, this shouldn't affect either the input hash (which is created by any values of keys with "text", "image", "html", "input") or the task hash (which concatenates the input hash with a new hash for values from keys named "spans", "label", "options", "arcs")

Another possibility, here was an audio user who had issues with their hashing because they accidentally used the keyname "paths" (there are several others like this on the default ignore list). Low chance but definitely could be what's explaining this weird behavior.