Prodigy Active Learning prefer_uncertain mechanism

Hi all,

As I go deep with active learning in Prodigy, I’m more and more confused about the prefer_uncertain function. It said that the prefer_uncertain rerank the examples, but the sort operation works in one batch or several batches?

I assume it doesn’t sort in whole dataset. If that is the case, after choosing the uncertain items for user to annotate, what about the rest in these batches? Were they just threw away. I’m so curious that as the model updates, these previous certain items may have different scores. Don’t they should be considered again?

What’s more, what’s the rule in prefer_uncertain? Does it prefer the one of score close to 0 or close to 0.5 and -0.5? I ask this question because when I take a look of the scores of annotated items in order. I didn’t find any rule about the score order.

By the way, I’m running prodigy with a pytorch model. Could you give any idea about how to verify the active learning really works?

Thanks a lot!


The prefer_uncertain is sorting a generator, so to do that, yes we drop examples . We assume the feed is infinite; finite feeds can be cycled anyway. So we don’t want to hold examples aside — we just move past them.

The mechanics are a bit subtle. What we do is track the moving average of the uncertainty score, and output examples which are more than one standard deviation uncertain. You might also consider setting algorithm="probability" to change how this works. The probability algorithm draws a random variable and uses the uncertainty score as the probability to retain the example.

prefer_uncertain uses distance from 0.5. It assumes the scores are in the range [0, 1]. There’s also a prefer_high_scores function. If you want to use your own figure-of-merit instead of distance from 0.5, you can always output tuples with your new score and just use prefer_high_scores.

I guess loop over the same examples, and check that the scores are really changing? You could also make a call to your model in your update() callback, to check that the model assigns a different score to the example after your update.


Thanks for your reply. But I still have a question. It seems that the stream is a generator which could only be called once in one process. So does the infinite feed mean load the process again and again?

By default we don’t loop infinitely over the stream, as we don’t want to assume that. It’s easy to add that though. For instance, you can put your stream logic in a little Python script, and then pipe that forward into Prodigy (all the recipes read from standard input).