I am using prodigy for text classification and I have a very large and quite unbalanced dataset: few positive examples compared to the corpus size.

I first used the --seeds option to get the best text to annotate. It was very nice for a cold start and worked quite well. I easily got my first hundred positive examples and I used them to train my model.

I am now trying to use this trained model to pick the best examples to annotate from the whole corpus, and not only from text selected with the --seeds option

However the active learning part doesn’t work quite well for me right now because it suggests me a lot of examples with a low probability to be positive. So it is hard to find new positive labels to annotate.

Maybe I am wrong but I got the impression that it was because prefer_uncertain and prefer_high_scores were not looking at enough examples from my stream.
I would like to know how many examples from the stream the functions prefer_high_scores and prefer_uncertain are looking at before sorting them and what are the best way to increase this threshold. So my model could see more examples before sorting them according to their probability to match my label.

You can choose between two algorithms for prefer_uncertain:

prefer_uncertain(stream, algorithm='probability'): This randomly drops examples with probability proportional to distance from 0.5. So if an example is given a score 0.5 by the model, it will be dropped with probability close to 0.0 (so you’ll probably see it for annotation). If an example is given a score 0.99 or 0.01 by the model, it will probably be dropped. Probabilities are clipped so that even if all of the scores are 0.0 or 1.0, you’ll still see some examples.

prefer_uncertain(stream, algorithm='ema'): This is the default sorter. It tracks the exponential moving average of the uncertainties, and also tracks a moving variance. It then asks questions which are one standard deviation or more above the current average uncertainty.

The main difference is that the EMA method is basically insensitive to the absolute values of the scores. This can be useful because during the active learning, the model can end up in states where the scores are quite miscalibrated. On the other hand, if you know the target class is rare, you want the sorter to “believe” the scores much more. In this case the probability sorter is better.

You can also add a bias to the sorter, so that the uncertainty calculation prefers higher scores to lower scores. Instead of sorting by distance from 0.5, you can sort by, say, distance from 0.6 or 0.7.

You may find it useful to implement your own sorting function. The signature required is very simple: it’s just a function that takes a sequence of (score, example) tuples, and yields the examples in some order, dropping examples as you require.

Thank you Matthew for the quick and detailed replied ! Very informative

I just tried to test the other algorithm but it seems the parameter is not exposed as an argument: TypeError: prefer_uncertain() got an unexpected keyword argument 'algorithm'

Anyway, it is very nice to be able to implement our own sorting function! I will play a bit to find a good sorter function for my use case.