Duplicate tasks when starting a new session

bigbeaker · May 1, 2019, 9:32am

I have created a custom recipe but I’m still getting repeat examples every-time I start a new training session

From the extract below shouldn’t the filter_tasks function ensure I never get duplicates?

@prodigy.recipe('custom',
    dataset=prodigy.recipe_args['dataset'],
    file_path=("Path to texts", "positional", None, str))
def custom(dataset, file_path):
    db = connect()  # uses the prodigy.json settings
    task_hashes = db.get_task_hashes(dataset)

    """Annotate the sentiment of texts using different mood options."""
    stream = JSONL(file_path)     # load in the JSONL file
    stream = filter_tasks(stream, task_hashes)
    stream = add_options(stream)  # add options to each task

    # Load the dummy model
    model = Classifier(model_path=model_path)

    stream = prefer_low_scores(model(stream))

    return {
        'dataset': dataset,   # save annotations in this dataset
        'view_id': 'choice',  # use the choice interface
        'exclude': [dataset],
        'stream': stream,
    }

ines · May 1, 2019, 10:45am

Hi! The exclude returned by your recipe should in theory take care of this, yes In your case, the filter_tasks is not going to make a difference, unless the incoming examples from file_path already include hashes. If not, there's nothing to compare against. So if you want to handle the filtering in your recipe, you also want to assign the hashes yourself:

hashed_stream = (prodigy.set_hashes(eg) for eg in stream)

Also double-check that nothing in your examples is changing between sessions. For example, if you're adding different options before hashing examples, the hashes are going to reflect that. So the same text with different options will receive different task hashes, which means Prodigy will treat those like different questions (which makes sense).

See this thread for a similar use case and more details:

Topic		Replies	Views
Continue to annotate same data in new session enhancement , done	19	4005	October 5, 2018
Image classification (choice) - Duplicated images image , solved	8	1702	May 16, 2019
Restarting prodigy on same dataset doesn't skip completed tasks (custom recipe)	3	361	October 5, 2022
Avoid restarting from zero... enhancement , usage , solved	19	1983	May 10, 2018
Exclude by task hash does not work bug , textcat	3	517	September 1, 2022

Duplicate tasks when starting a new session

Related topics