As I go deep with active learning in Prodigy, I'm more and more confused about the prefer_uncertain function. It said that the prefer_uncertain rerank the examples, but the sort operation works in one batch or several batches?

I assume it doesn't sort in whole dataset. If that is the case, after choosing the uncertain items for user to annotate, what about the rest in these batches? Were they just threw away. I'm so curious that as the model updates, these previous certain items may have different scores. Don't they should be considered again?

What's more, what's the rule in prefer_uncertain? Does it prefer the one of score close to 0 or close to 0.5 and -0.5? I ask this question because when I take a look of the scores of annotated items in order. I didn't find any rule about the score order.

By the way, I'm running prodigy with a pytorch model. Could you give any idea about how to verify the active learning really works?

The prefer_uncertain is sorting a generator, so to do that, yes we drop examples . We assume the feed is infinite; finite feeds can be cycled anyway. So we don’t want to hold examples aside — we just move past them.

The mechanics are a bit subtle. What we do is track the moving average of the uncertainty score, and output examples which are more than one standard deviation uncertain. You might also consider setting algorithm="probability" to change how this works. The probability algorithm draws a random variable and uses the uncertainty score as the probability to retain the example.

prefer_uncertain uses distance from 0.5. It assumes the scores are in the range [0, 1]. There’s also a prefer_high_scores function. If you want to use your own figure-of-merit instead of distance from 0.5, you can always output tuples with your new score and just use prefer_high_scores.

I guess loop over the same examples, and check that the scores are really changing? You could also make a call to your model in your update() callback, to check that the model assigns a different score to the example after your update.

Thanks for your reply. But I still have a question. It seems that the stream is a generator which could only be called once in one process. So does the infinite feed mean load the process again and again?

By default we don’t loop infinitely over the stream, as we don’t want to assume that. It’s easy to add that though. For instance, you can put your stream logic in a little Python script, and then pipe that forward into Prodigy (all the recipes read from standard input).

output examples which are more than one standard deviation uncertain

What is the standard dev calculated on? Std dev of all uncertainty scores? All the probabilities the model outputs (in case of binary text classification)? The standard dev of true class assignments?

You might also consider setting algorithm="probability" to change how this works. The probability algorithm draws a random variable and uses the uncertainty score as the probability to retain the example.

What is the random variable here? Is it a random sample from the input stream?

Not sure how this differs from prefer_uncertain? If I understood the pipeline right for probability it goes something like this: Takes a sample --> scores it (predict class probability) --> calculates uncertainty --> decides keep/drop?

I have been trying to understand how the active-learning is working under a teach recipe, specifically for the text classification case: textcat.teach. Couple of questions around it:

For a highly imbalanced dataset (major class being 0 in a binary classification task), is it better to use prefer_high_scores instead of prefer_uncertain to construct a more balanced dataset?

The sorting mechanism is designed to account for large input data sets, so it works in a streaming fashion, sorting chunks of data. It's a tricky balancing act because we don't want to block the annotation feed while we search for the best examples.

The standard deviation we refer to is a running estimate of the standard deviation. This is an approximation, the calculation is similar to making an exponential moving average estimation. It's the standard deviation of the specified "figure of merit". In prefer_high_scoring it will be the scores, in prefer_uncertain it'll be the uncertainty.

The algorithm='probability' does not try to adjust for the scale of the scores, instead trusting them directly.

Let's say you're using prefer_high_scores and all the scores are mostly coming in around 0.9. Under algorithm='probability', each of these examples will have a have a high chance of being emitted. But under the moving average algorithm, most of these examples will be filtered out, and the sorter will look for examples that score higher than average.

So if your model is already producing reasonably well calibrated estimates, you might want to use the algorithm='probability'. But if you don't trust the scores directly, or you worry that as you click your model might get 'stuck' with bad weights, the moving average sorter can be better.

Under imbalanced classes, you can generally expect that the scores for the positive class will stay low, so uncertainty and probability will be similar.

The sorting mechanism is designed to account for large input data sets, so it works in a streaming fashion, sorting chunks of data. It's a tricky balancing act because we don't want to block the annotation feed while we search for the best examples.

I understand the need for chunking the data to score fast so that the annotator doesn't wait for samples to show up in UI. However it would be great if the chunk size was a parameter that the user can set since for example for my usecase (very highly imbalanced dataset (1:1000+)) thus with the default streaming mechanism, my time is spent labeling only the majority class most of the time.

If a parameter was available to set the chunk size, I would have set it to 2k and considering models' scoring is pretty fast and I don't mind waiting 5 seconds for the next batch of examples to show up on the UI.

By default we don’t loop infinitely over the stream, as we don’t want to assume that. It’s easy to add that though. For instance, you can put your stream logic in a little Python script, and then pipe that forward into Prodigy (all the recipes read from standard input).

I understand that the chunk size parameter might not be available for the user ever thus I was intrigued by this suggestion you made earlier. Is there an example of this "stream logic python script" somewhere that I can start off from?

By the way, the current workaround I found was to score and sort the whole unlabeled dataset using my own model and then save as .jsonl. Then I use the default streaming function in textcat.manual or textcat.teach.