prefer_uncertain: how does it use the stream to pick examples to score?

honnibal · December 12, 2017, 12:57pm

You can choose between two algorithms for prefer_uncertain:

prefer_uncertain(stream, algorithm='probability'): This randomly drops examples with probability proportional to distance from 0.5. So if an example is given a score 0.5 by the model, it will be dropped with probability close to 0.0 (so you’ll probably see it for annotation). If an example is given a score 0.99 or 0.01 by the model, it will probably be dropped. Probabilities are clipped so that even if all of the scores are 0.0 or 1.0, you’ll still see some examples.
prefer_uncertain(stream, algorithm='ema'): This is the default sorter. It tracks the exponential moving average of the uncertainties, and also tracks a moving variance. It then asks questions which are one standard deviation or more above the current average uncertainty.

The main difference is that the EMA method is basically insensitive to the absolute values of the scores. This can be useful because during the active learning, the model can end up in states where the scores are quite miscalibrated. On the other hand, if you know the target class is rare, you want the sorter to “believe” the scores much more. In this case the probability sorter is better.

You can also add a bias to the sorter, so that the uncertainty calculation prefers higher scores to lower scores. Instead of sorting by distance from 0.5, you can sort by, say, distance from 0.6 or 0.7.

You may find it useful to implement your own sorting function. The signature required is very simple: it’s just a function that takes a sequence of (score, example) tuples, and yields the examples in some order, dropping examples as you require.

Topic		Replies	Views
Prodigy Active Learning prefer_uncertain mechanism usage , custom , pytorch	8	1850	November 23, 2020
using sorters (prefer_uncertain or prefer_high_scores) result in prodigy showing me the same data samples with different predictions usage , streams	3	584	March 13, 2021
Scoring and sorting all samples during textcat teach usage , textcat	2	533	November 2, 2020
active learning covering all candidates custom	4	317	October 4, 2022
Is it possible to combine prefer_high_scores and prefer_uncertain so that a combination of high score and mid score documents can be batched out usage , streams	1	397	June 24, 2020

prefer_uncertain: how does it use the stream to pick examples to score?

Related topics