I was hoping to get some guidance in using prodigy & active learning to train a Scikit-learn or other custom models for possibly non NLP tasks. For instance, based on multiple R,G,B values in an image, train a model to identify human friendly colors Orange, Cyan, purple etc..
The hope really is to find a generic usage pattern to for integrating my custom learner to prodigy's active learning based annotator.
There are two main ways to leverage a model in Prodigy.
Offline Active Learning
Use the model to generate predictions on your unlabelled data upfront such that you can sort/select a subset of interest before passing the data into Prodigy. This can be done from a Jupyter notebook. I guess you might call this the "offline" approach because this can happen before annotating.
Have the model update while annotations come in. In this case the model would learn from the stream of annotations that come in before making predictions on the next batch of data.
Figured I might make a diagram of this.
Here's a conceptual overview of what is done when Prodigy tries to run an annotation interface with online active learning ready.
First, Prodigy produces a batch of data to annotate. These might be examples that the model is relatively uncertain about, but you may also try to select a specific class. Once the batch is ready it will have model predictions attached.
If you're interested in using scikit-learn then I would recommend trying offline active learning first. Mostly because it's easier to setup, but also because it's an easier integration with scikit-learn. Not every scikit-learn model allows for online learning. There are some pipelines that support the .partial_fit mechanic but it's a small subset. Here's a tutorial that dives into more detail on this topic:
If you'd really like to explore an online learning system for scikit-learn, you might enjoy to learn that I have written some helpers. This project may be of help:
Another option
If you're dealing with smaller datasets, you might also consider retraining your model in every batch. This won't scale, but it is a bunch easier to implement. I have a demo project of this in this repo if you're interested in doing that: