Generic workflow for Active Learning for non NLP tasks and a custom or scikit-learn model


I was hoping to get some guidance in using prodigy & active learning to train a Scikit-learn or other custom models for possibly non NLP tasks. For instance, based on multiple R,G,B values in an image, train a model to identify human friendly colors Orange, Cyan, purple etc..

The hope really is to find a generic usage pattern to for integrating my custom learner to prodigy's active learning based annotator.

Thanks for your help!

Hi Atul.

There are two main ways to leverage a model in Prodigy.

Offline Active Learning

Use the model to generate predictions on your unlabelled data upfront such that you can sort/select a subset of interest before passing the data into Prodigy. This can be done from a Jupyter notebook. I guess you might call this the "offline" approach because this can happen before annotating.

Online Active Learning

Have the model update while annotations come in. In this case the model would learn from the stream of annotations that come in before making predictions on the next batch of data.

Figured I might make a diagram of this.

Here's a conceptual overview of what is done when Prodigy tries to run an annotation interface with online active learning ready.

First, Prodigy produces a batch of data to annotate. These might be examples that the model is relatively uncertain about, but you may also try to select a specific class. Once the batch is ready it will have model predictions attached.

Then, the user adds annotations.

The user may correct the predictions of the model but may also choose to keep them. These annotations are stored in the database.

But in this online scenario they will also be used to update the model!

Then, when the model is updated, Prodigy can fetch a new batch of data and the process repeats.

About scikit-learn.

If you're interested in using scikit-learn then I would recommend trying offline active learning first. Mostly because it's easier to setup, but also because it's an easier integration with scikit-learn. Not every scikit-learn model allows for online learning. There are some pipelines that support the .partial_fit mechanic but it's a small subset. Here's a tutorial that dives into more detail on this topic:

If you'd really like to explore an online learning system for scikit-learn, you might enjoy to learn that I have written some helpers. This project may be of help:

Another option

If you're dealing with smaller datasets, you might also consider retraining your model in every batch. This won't scale, but it is a bunch easier to implement. I have a demo project of this in this repo if you're interested in doing that:

Does this help?

1 Like