Including own active learning function / active learning outputs

Dear Prodigy,

Two questions:

  1. I would like to add my own active learning function and/or external 3rd party supported active learning library. Is it possible to include those within prodigy instead of build-in Prodigy’s active learning? If yes, could you provide some guidance on how to achieve it.

  2. Providing a wrapper function should do the trick, but I am having difficulty figuring out what are the outputs of the build-in active learning function for the update return ‘stream’ variable.

You assistance will be much appreciated. Thank you.

Yes, that’s definitely possible :slightly_smiling_face: You can do this by writing a custom recipe that returns the incoming stream of examples filtered by your model, and an update callback that updates your model with incoming answers.

You might find this example useful, which shows how to plug in a custom model in the loop, using the example of a simple dummy model:

There are basically 4 components here:

  • A generator that yields out annotation examples, e.g. from a file.
  • A function that takes a stream of examples, gets a score for each example from the model and yields (score, example) tuples.
  • A sorter function that takes a stream of (score, example) tuples and yields out examples. That’s where part of the active learning happens: based on the score, you can apply whatever metric to decide whether to send out an example or not. The built-in prefer_uncertain sorter uses an exponential moving average to track the score and will prefer scores closest to 0.5 – but you can also implement your own logic that handles this differently (high scores in certain conditions, low scores in others and so on).
  • An update function that takes a list of answers (the original example with an "answer" key that’s either "accept", "reject" or "ignore"). Based on that, you can then update your model accordingly. If you make your update function return the loss, that value will be used to calculate the progress bar, which is a rough estimate of when the loss will hit 0.
1 Like

Ines,

Many thanks for your comprehensive reply. I believe many of newcomers will find your reply much useful.

I want you to know I was able to complete all the steps you have outlined. At the moment, I am rather more interested in applying my own active learning function, such as (as presented in GitHub source; line 64):

stream = prefer_uncertain(model(stream))

would be:

stream = my_own_function(model(stream))

Is it possible?

I believe there should be no trouble, however, I cannot figure out the outputs of the prefer_uncertain(), so my own function would produce the same output for the stream. As before, you help is much appreciated.

Thanks, glad it was helpful!

Yes, you can replace that with your own function that takes a stream of (score, example) tuples and yields examples. Here'a a super basic example:

def my_own_function(scored_stream):
    for score, example in scored_stream:
        if score >= 0.5:
            yield example

Based on the score (and any other state), your function can then decide whether to send out an example or not. In a real-world scenario, you probably also want to be using an exponential moving average, or implement some other logic to make sure you never get stuck. The above code could potentially end up looping forever if the model ends up in a state where it always predicts low scores.

1 Like

Many thanks for your replies. The approach seems clear.

For future Prodigy implementations, what I would find helpful, if sampling function (batch selection) could be wrapped up within custom active learning function instead of feeding it directly to the model.

At the moment the custom function only provides a threshold, but is not responsible for batching.

Some of the external AL libraries combine both (batch & AL algorithm) into single function (AL_method, batch_size).

Many thanks for your hard work.

Just to make sure I understand your question correctly: You mean, instead of specifying the batch_size in the recipe config or global config, the stream generator should be in charge of batching and batch sizing?

I was rather thinking on operating more on local level, within my_own_function, so complete batching and function processing would be within custom AL function.

Well, there is many ways to skin the cat. Thanks much for all your support.