Active Learning Methodology

I was recently accepted to the Prodigy beta and am looking forward to testing it out. When reading through the documentation, I was not able to find information on how the active learning is actually done. What methodology are you using to determine whether a specific instance needs to be labeled? Is there flexibility to change the active learning selection algorithm or the parameters associated with the in-built algorithm?

It seems a custom algorithm could be added in a similar way as the filter_tweets method in the example recipe for classifying tweets, but I am not sure how an implementation there would interact with the in-built algorithm.

Please let me know if you have any additional information about how the active learning works inside Prodigy. Thanks for putting together such a useful tool!


In the example recipes, the model is a generator that takes example dicts as input, and yields out (score, example) tuples. For instance:

def dummy_model(tasks):
    for example in tasks:
        yield 0.5, example

The active learning is performed by reranking this generator, based on the scores. A few functions are provided in the prodigy.components.sorters module. The one used by default is prefer_uncertain.

You can write your own active learning function. It should take a sequence of (score, example) tuples, and yield example objects. You might want to batch the sequence to sort it, like so:

import cytoolz

def example_sorter(scored, batch_size=128):
    for batch in cytoolz.partition_all(batch_size, scored):
        batch = list(batch)
        # prefer low scores, within the batch.
        for score, example in scored:
            yield example

In the recipe function, you return the examples in the key stream. So you could do something like:

return {
    'dataset': dataset,
    'stream': example_sorter(dummy_model(stream)),
    'update': create_update_callback(dummy_model),
1 Like