I am currently utilizing my company’s license for Prodigy to annotate some data (for text classification) that I am interested in later analyzing with Spacy. As part of the annotation process, is there a way to resume annotation where you left off from your previous session ?
(Ex. If I do 200 annotations on Day 1, I would like to resume at 201, and not have to worry about re-annotating data from the first set of 200. I noticed when using the software yesterday, I had to repeat annotating certain data points during my second annotation session.)
Hi! By default, Prodigy tries to make as little assumptions about your existing dataset as possible – but you can tell it to explicitly ignore annotations present in one or more datasets using the
--exclude option. For example:
prodigy textcat.teach your_dataset en_core_web_sm data.jsonl --label XXX --exclude your_dataset
This will exclude all examples that were already annotated in the dataset
your_dataset (i.e. the current one you’re also saving your annotations to). The exclude mechanism is also useful when you’re creating evaluation sets, to make sure that no training examples accidentally end up in your evaluation data (or vice versa).
If you’re using an active learning-powered recipe like
textcat.teach, you’re also training a model in the loop. So if you want to restart with the same model state, you can also pre-train the base model with the existing annotations and then use this model as the starting point. For example:
prodigy textcat.batch-train your_dataset en_core_web_sm --output /path/to/model
prodigy textcat.teach your_dataset /path/to/model --label XXX -exclude your_dataset
If you’re using a custom recipe, you can specify the dataset name(s) to exclude as the
'exclude' setting returned by your recipe. This can be a list of one or more strings:
# your recipe here
'exclude': [dataset], # always exclude the current set
# other recipe config here