Including own active learning function / active learning outputs

krstp · May 8, 2019, 11:27pm

Dear Prodigy,

Two questions:

I would like to add my own active learning function and/or external 3rd party supported active learning library. Is it possible to include those within prodigy instead of build-in Prodigy’s active learning? If yes, could you provide some guidance on how to achieve it.
Providing a wrapper function should do the trick, but I am having difficulty figuring out what are the outputs of the build-in active learning function for the update return ‘stream’ variable.

You assistance will be much appreciated. Thank you.

ines · May 9, 2019, 8:36am

Yes, that’s definitely possible You can do this by writing a custom recipe that returns the incoming stream of examples filtered by your model, and an update callback that updates your model with incoming answers.

You might find this example useful, which shows how to plug in a custom model in the loop, using the example of a simple dummy model:

github.com

explosion/prodigy-recipes/blob/master/textcat/textcat_custom_model.py

# coding: utf8
from __future__ import unicode_literals

import prodigy
from prodigy.components.loaders import JSONL
from prodigy.components.sorters import prefer_uncertain
from prodigy.util import split_string
import random


class DummyModel(object):
    # This is a dummy model to help illustrate how to use Prodigy with a model
    # in the loop. It currently "predicts" random numbers – but you can swap
    # it out for any model of your choice, for example a text classification
    # model implementation using PyTorch, TensorFlow or scikit-learn.

    def __init__(self, labels=None):
        # The model can keep arbitrary state – let's use a simple random float
        # to represent the current weights
        self.weights = random.random()

This file has been truncated. show original

There are basically 4 components here:

A generator that yields out annotation examples, e.g. from a file.
A function that takes a stream of examples, gets a score for each example from the model and yields (score, example) tuples.
A sorter function that takes a stream of (score, example) tuples and yields out examples. That’s where part of the active learning happens: based on the score, you can apply whatever metric to decide whether to send out an example or not. The built-in prefer_uncertain sorter uses an exponential moving average to track the score and will prefer scores closest to 0.5 – but you can also implement your own logic that handles this differently (high scores in certain conditions, low scores in others and so on).
An update function that takes a list of answers (the original example with an "answer" key that’s either "accept", "reject" or "ignore"). Based on that, you can then update your model accordingly. If you make your update function return the loss, that value will be used to calculate the progress bar, which is a rough estimate of when the loss will hit 0.

krstp · May 9, 2019, 2:00pm

Ines,

Many thanks for your comprehensive reply. I believe many of newcomers will find your reply much useful.

I want you to know I was able to complete all the steps you have outlined. At the moment, I am rather more interested in applying my own active learning function, such as (as presented in GitHub source; line 64):

stream = prefer_uncertain(model(stream))

would be:

stream = my_own_function(model(stream))

Is it possible?

I believe there should be no trouble, however, I cannot figure out the outputs of the prefer_uncertain(), so my own function would produce the same output for the stream. As before, you help is much appreciated.

ines · May 9, 2019, 2:09pm

Thanks, glad it was helpful!

Yes, you can replace that with your own function that takes a stream of (score, example) tuples and yields examples. Here'a a super basic example:

def my_own_function(scored_stream):
    for score, example in scored_stream:
        if score >= 0.5:
            yield example

Based on the score (and any other state), your function can then decide whether to send out an example or not. In a real-world scenario, you probably also want to be using an exponential moving average, or implement some other logic to make sure you never get stuck. The above code could potentially end up looping forever if the model ends up in a state where it always predicts low scores.

krstp · May 9, 2019, 8:58pm

Many thanks for your replies. The approach seems clear.

For future Prodigy implementations, what I would find helpful, if sampling function (batch selection) could be wrapped up within custom active learning function instead of feeding it directly to the model.

At the moment the custom function only provides a threshold, but is not responsible for batching.

Some of the external AL libraries combine both (batch & AL algorithm) into single function (AL_method, batch_size).

Many thanks for your hard work.

ines · May 10, 2019, 11:39am

Just to make sure I understand your question correctly: You mean, instead of specifying the batch_size in the recipe config or global config, the stream generator should be in charge of batching and batch sizing?

krstp · May 10, 2019, 3:09pm

I was rather thinking on operating more on local level, within my_own_function, so complete batching and function processing would be within custom AL function.

Well, there is many ways to skin the cat. Thanks much for all your support.

Topic		Replies	Views
Custom model Requirements usage , custom	8	2920	March 25, 2019
Active Learning Methodology api	1	899	September 20, 2017
active learning covering all candidates custom	4	318	October 4, 2022
Prodigy Active Learning prefer_uncertain mechanism usage , custom , pytorch	8	1851	November 23, 2020
Generic workflow for Active Learning for non NLP tasks and a custom or scikit-learn model usage , custom	1	219	September 11, 2023

Including own active learning function / active learning outputs

Related topics