I was recently accepted to the Prodigy beta and am looking forward to testing it out. When reading through the documentation, I was not able to find information on how the active learning is actually done. What methodology are you using to determine whether a specific instance needs to be labeled? Is there flexibility to change the active learning selection algorithm or the parameters associated with the in-built algorithm?
It seems a custom algorithm could be added in a similar way as the filter_tweets method in the example recipe for classifying tweets, but I am not sure how an implementation there would interact with the in-built algorithm.
Please let me know if you have any additional information about how the active learning works inside Prodigy. Thanks for putting together such a useful tool!
In the example recipes, the model is a generator that takes example dicts as input, and yields out (score, example) tuples. For instance:
def dummy_model(tasks):
for example in tasks:
yield 0.5, example
The active learning is performed by reranking this generator, based on the scores. A few functions are provided in the prodigy.components.sorters module. The one used by default is prefer_uncertain.
You can write your own active learning function. It should take a sequence of (score, example) tuples, and yield example objects. You might want to batch the sequence to sort it, like so:
import cytoolz
def example_sorter(scored, batch_size=128):
for batch in cytoolz.partition_all(batch_size, scored):
batch = list(batch)
# prefer low scores, within the batch.
scored.sort(reverse=True)
for score, example in scored:
yield example
In the recipe function, you return the examples in the key stream. So you could do something like: