You can choose between two algorithms for prefer_uncertain
:
-
prefer_uncertain(stream, algorithm='probability')
: This randomly drops examples with probability proportional to distance from 0.5. So if an example is given a score 0.5 by the model, it will be dropped with probability close to 0.0 (so you’ll probably see it for annotation). If an example is given a score 0.99 or 0.01 by the model, it will probably be dropped. Probabilities are clipped so that even if all of the scores are 0.0 or 1.0, you’ll still see some examples. -
prefer_uncertain(stream, algorithm='ema')
: This is the default sorter. It tracks the exponential moving average of the uncertainties, and also tracks a moving variance. It then asks questions which are one standard deviation or more above the current average uncertainty.
The main difference is that the EMA method is basically insensitive to the absolute values of the scores. This can be useful because during the active learning, the model can end up in states where the scores are quite miscalibrated. On the other hand, if you know the target class is rare, you want the sorter to “believe” the scores much more. In this case the probability sorter is better.
You can also add a bias to the sorter, so that the uncertainty calculation prefers higher scores to lower scores. Instead of sorting by distance from 0.5, you can sort by, say, distance from 0.6 or 0.7.
You may find it useful to implement your own sorting function. The signature required is very simple: it’s just a function that takes a sequence of (score, example)
tuples, and yields the examples in some order, dropping examples as you require.