Resuming annotation with a model in the loop

Hi all,

I have query regarding annotations in text classification/NER. It might be an easy one.

Once I close the front-end browser (after saving annotations) and restart the browser again with the same data or let’s just say I can’t annotate the whole data in one go, same sentences repeats again from the start with an updated score (i.e. the score is different than the 1st iteration) in next iteration. Is it normal that all the sentences will be repeated again with the updated score?

Or

The sentences should not repeat again in 2nd iteration? I can understand about some of the sentence, since the model is still confused about it, but couldn’t understand about ALL of the sentences.

Thanks

By default, Prodigy makes very little assumptions about your stream of data, so when you exit the Prodigy server (i.e. quit it in your terminal), Prodigy will start again at the beginning of the stream. However, you can tell it to exclude annotations from one or more datasets by using the --exclude argument. So when you start the server again, you won’t be asked about the tasks you’ve already annotated:

prodigy ner.teach your_dataset en_core_web_sm data.jsonl --exclude your_dataset

You can also use multiple datasets, e.g. --exclude set_one,set_two. Excluding sets can also be useful for creating evaluation data, because you definitely want to make sure that your evaluation set doesn’t contain tasks from your training set, and vice versa.

When you re-load the web app (and keep the server running), Prodigy will simply make another request to the /get_questions endpoint and fetch a new batch of tasks from the stream. So you shouldn’t see any duplication here.

If you want to stop annotating and start again later, you can always restore the model in the loop by training from the already collected annotations, and then using that pre-trained model in the next annotation session. For example:

prodigy ner.teach your_dataset en_core_web_sm data.jsonl

prodigy ner.batch-train your_dataset en_core_web_sm /output-model
prodigy ner.teach your_dataset /output-model data.jsonl --exclude your_dataset

The /output-model will be trained on the examples collected in the previous session, so it will be very similar to the model you had in the loop before – often much better, though, because the batch-train recipes use multiple iterations and other tricks to improve accuracy, like shuffling the data, setting a dropout rate etc.

Each model you save out with the batch-train recipe will also include two JSONL files containing the training and evaluation data. This means you’ll always be able to re-produce the results, or restore the training data from a previous model (if you’ve made a mistake, want to try adding examples from a different source etc.)

1 Like

Hi,

That solves lot of my doubts.
Yes, I couldn’t use the same model for annotating the dataset later on (after I saved the model in directory). But I will be able to do that now. :slight_smile:
Your support rocks!!

Thanks a lot!!

1 Like