Taking advantage of TONS of unlabeled data

With the current active learning setting predictions are made by batch and then the active learning algorithm decides amongst those predictions which will help the model learn most quickly (and make annotation process be fast ).

The problem is that after a relatively short period of time it is unlikely that the active learning algorithm will find helpful candidate annotations in each batch - it just finds stuff that it is already quite confident about.

Do you have any suggestions for getting the active learning system to look at a bigger sample of data or skip entire batches before presenting text to the annotator?

You might want to implement your own sorter method for this. The function just needs to take a sequence of (score, example) tuples and yield out a sequence of examples. Another simple thing you could try is to use prefer_uncertain(stream, algorithm='probability'). The probability algorithm randomly drops examples proportional to their score. In contrast, the default sorter doesn’t trust that the scores produced by the model are going to stay well calibrated. It asks you about examples which are higher scoring than average, and tracks that average over time. So, if you get a long sequence of low-scoring examples, the probability sorter will ask you fewer questions from them, while the exponential moving average sorter will get impatient, and start asking you questions even if the scores are low.

Hi,
When using prefer_low_scores, the sorter will not stream the data with high scores, right? So, that removed data or the predictions of that confident data will be saved somewhere or not ?
If not, and the confident data is discarded, then don’t you think the training data distribution might get biased and might be skewed away from the bigger classes? Is that a matter of concern for the models?

No, the filters like prefer_uncertain or prefer_high_scores decide what examples in your incoming data to send out. So everything that's not sent out doesn't get annotated and won't be saved in the database.

This is actually kind of the purpose of using a sorter like prefer_high_scores. In cases where you have a very imbalanced distribution, you might want to explicitly bias the examples selection to achieve better results.

Great. Thanks alot!!