Resume Annotation Session with Prodigy - Text Classification

Hi! By default, Prodigy tries to make as little assumptions about your existing dataset as possible – but you can tell it to explicitly ignore annotations present in one or more datasets using the --exclude option. For example:

prodigy textcat.teach your_dataset en_core_web_sm data.jsonl --label XXX --exclude your_dataset

This will exclude all examples that were already annotated in the dataset your_dataset (i.e. the current one you’re also saving your annotations to). The exclude mechanism is also useful when you’re creating evaluation sets, to make sure that no training examples accidentally end up in your evaluation data (or vice versa).

If you’re using an active learning-powered recipe like textcat.teach, you’re also training a model in the loop. So if you want to restart with the same model state, you can also pre-train the base model with the existing annotations and then use this model as the starting point. For example:

prodigy textcat.batch-train your_dataset en_core_web_sm --output /path/to/model
prodigy textcat.teach your_dataset /path/to/model --label XXX -exclude your_dataset

If you’re using a custom recipe, you can specify the dataset name(s) to exclude as the 'exclude' setting returned by your recipe. This can be a list of one or more strings:

@prodigy.recipe('custom-recipe')
def custom_recipe(dataset):
    # your recipe here
   return {
        'dataset': dataset,
        'exclude': [dataset],  # always exclude the current set
        # other recipe config here
   } 
1 Like