Duplicate tasks when starting a new session

custom
usage
#1

I have created a custom recipe but I’m still getting repeat examples every-time I start a new training session

From the extract below shouldn’t the filter_tasks function ensure I never get duplicates?

@prodigy.recipe('custom',
    dataset=prodigy.recipe_args['dataset'],
    file_path=("Path to texts", "positional", None, str))
def custom(dataset, file_path):
    db = connect()  # uses the prodigy.json settings
    task_hashes = db.get_task_hashes(dataset)

    """Annotate the sentiment of texts using different mood options."""
    stream = JSONL(file_path)     # load in the JSONL file
    stream = filter_tasks(stream, task_hashes)
    stream = add_options(stream)  # add options to each task

    # Load the dummy model
    model = Classifier(model_path=model_path)

    stream = prefer_low_scores(model(stream))

    return {
        'dataset': dataset,   # save annotations in this dataset
        'view_id': 'choice',  # use the choice interface
        'exclude': [dataset],
        'stream': stream,
    }
(Ines Montani) #2

Hi! The exclude returned by your recipe should in theory take care of this, yes :thinking: In your case, the filter_tasks is not going to make a difference, unless the incoming examples from file_path already include hashes. If not, there’s nothing to compare against. So if you want to handle the filtering in your recipe, you also want to assign the hashes yourself:

hashed_stream = (prodigy.set_hashes(eg) for eg in stream)

Also double-check that nothing in your examples is changing between sessions. For example, if you’re adding different options before hashing examples, the hashes are going to reflect that. So the same text with different options will receive different task hashes, which means Prodigy will treat those like different questions (which makes sense).

See this thread for a similar use case and more details: