Generic workflow for Active Learning for non NLP tasks and a custom or scikit-learn model

atul · September 11, 2023, 7:06am

Hi,

I was hoping to get some guidance in using prodigy & active learning to train a Scikit-learn or other custom models for possibly non NLP tasks. For instance, based on multiple R,G,B values in an image, train a model to identify human friendly colors Orange, Cyan, purple etc..

The hope really is to find a generic usage pattern to for integrating my custom learner to prodigy's active learning based annotator.

Thanks for your help!

koaning · September 11, 2023, 9:03am

Hi Atul.

There are two main ways to leverage a model in Prodigy.

Offline Active Learning

Use the model to generate predictions on your unlabelled data upfront such that you can sort/select a subset of interest before passing the data into Prodigy. This can be done from a Jupyter notebook. I guess you might call this the "offline" approach because this can happen before annotating.

Online Active Learning

Have the model update while annotations come in. In this case the model would learn from the stream of annotations that come in before making predictions on the next batch of data.

Figured I might make a diagram of this.

Here's a conceptual overview of what is done when Prodigy tries to run an annotation interface with online active learning ready.

First, Prodigy produces a batch of data to annotate. These might be examples that the model is relatively uncertain about, but you may also try to select a specific class. Once the batch is ready it will have model predictions attached.

Then, the user adds annotations.

The user may correct the predictions of the model but may also choose to keep them. These annotations are stored in the database.

But in this online scenario they will also be used to update the model!

Then, when the model is updated, Prodigy can fetch a new batch of data and the process repeats.

About scikit-learn.

If you're interested in using scikit-learn then I would recommend trying offline active learning first. Mostly because it's easier to setup, but also because it's an easier integration with scikit-learn. Not every scikit-learn model allows for online learning. There are some pipelines that support the .partial_fit mechanic but it's a small subset. Here's a tutorial that dives into more detail on this topic:

If you'd really like to explore an online learning system for scikit-learn, you might enjoy to learn that I have written some helpers. This project may be of help:

Another option

If you're dealing with smaller datasets, you might also consider retraining your model in every batch. This won't scale, but it is a bunch easier to implement. I have a demo project of this in this repo if you're interested in doing that:

github.com

koaning/scikit-prodigy/blob/main/recipes/binary_textcat.py

"""
This recipe assumes binary text classification done via scikit-learn. 
You're able to annotate as you would normally, but you can also set the
`--correct` flag which will train a scikit-learn model just before annotation.
You can then annotate more positive, negative or uncertain examples based 
on the `--prefer` setting in the recipe.

## USAGE

The default usage, which you should use to start with. 

```
python -m prodigy textcat.sklearn sklearn-demo examples.jsonl --label insult -F recipe.py
```

Then, once we have positive/negative examples that sklearn could train on, you can
use it for model-in-the-loop annotation. 

```
python -m prodigy textcat.sklearn sklearn-demo examples.jsonl --label insult --correct --prefer uncertain -F recipe.py

This file has been truncated. show original

Does this help?

Topic		Replies	Views
Custom model Requirements usage , custom	8	2920	March 25, 2019
Including own active learning function / active learning outputs usage , api , custom	6	737	May 10, 2019
Active Learning Methodology api	1	899	September 20, 2017
Is it possible for me to control the entire active learning loop? custom , front-end	5	1212	March 20, 2018
Use cases demo + clarifications for Business usage	2	930	April 14, 2020

Generic workflow for Active Learning for non NLP tasks and a custom or scikit-learn model

Offline Active Learning

Online Active Learning

About scikit-learn.

Another option

Related topics