active learning and update function

Hi! The update callback is called by Prodigy whenever a batch of examples comes back from the web app. It receives a list of annotated examples in Prodigy's JSON format – so basically whatever was sent out via the stream, with the added annotations (e.g. manually added spans) and the "answer". See here for the API docs: Custom Recipes · Prodigy · An annotation tool for AI, Machine Learning & NLP

Prodigy's binary annotation recipes use a more complex annotation model (e.g. the EntityRecognizer class implemented by Prodigy) to handle updating a spaCy model from binary yes/no annotations. See my slides here for details on why this is slightly more complex.

If you're annotating manually, that shouldn't be necessary – or at least, you should be able to assume that the examples you're getting back are complete and corrected annotations. So you'll be able to just call nlp.update on your spaCy model directly. Pretty much exactly like you would train the model from scratch: https://v2.spacy.io/usage/training#training-simple-style (just with nlp.resume_training instead of begin_training – otherwise you'd be resetting the weights).

The loss is definitely a good one to track, and it's also returned by nlp.update. If you want to test the whole end-to-end process, you could als simulate an annotation session: call nlp.update with batches of examples, and then keep evaluating the predictions after multiple updates.

For the active learning recipes, Prodigy uses the loss to estimate the progress, to basically give you a rough idea of when to stop annotating (when the loss might hit zero and there's nothing left to learn). You could also implement your own progress callback that receives whatever is returned from your update callback (e.g. the loss) and returns a progress value. This could be based on how many examples are left, or a combination of that and the loss over time.

It kinda depends – this is the batch size used to update the model, so you typically want to find a good trade-off between large enough to be effective and small enough to be efficient. We've specifically optimised spaCy to be updatable with small batches to allow workflows like this, so the default batch size of 10 should work okay – but if you're working with other implementations or newer transformer-based pipelines, you might want to experiment with using larger batch sizes.