Taking advantage of TONS of unlabeled data

You might want to implement your own sorter method for this. The function just needs to take a sequence of (score, example) tuples and yield out a sequence of examples. Another simple thing you could try is to use prefer_uncertain(stream, algorithm='probability'). The probability algorithm randomly drops examples proportional to their score. In contrast, the default sorter doesn’t trust that the scores produced by the model are going to stay well calibrated. It asks you about examples which are higher scoring than average, and tracks that average over time. So, if you get a long sequence of low-scoring examples, the probability sorter will ask you fewer questions from them, while the exponential moving average sorter will get impatient, and start asking you questions even if the scores are low.