Active Learning: Does it work?

Interesting post from what appears to be a competitor of yours on why they don’t support active learning:

I think they have a valid point… the active learning is only as good as the model it’s using. My biggest concern is that this issue will be more acute for corpuses with a sufficiently unique vocabulary.

Have you done any analysis to show that the active learning approach is in fact superior to the alternatives? Have you thought about supporting other active learning approaches that might not suffer from these kinds of issues (such as Query by Committee mentioned in the blog post)?

Changing the active learning is very easy: it’s just a sort function that the feed gets filtered through. So you can write any function that takes an iterable of (score, example) pairs and produces an iterable of examples, and put that after the model. Prodigy will take care of keeping the model updated in the loop, so you only have to write the function itself — which is very easy.

I think it’s true that a purely online service will probably struggle to make active learning useful in their workflow. If the tool is entirely online, then it’s difficult to switch between different modes, and it’s difficult to start and stop the server to run a batch-train process. It’s also difficult to interact with the tool programmattically — which I think is super useful.

I think for text classification with roughly balanced classes, it’s pretty uncertain whether active learning will help. But if you’ve got a number of rarer classes, the situation is quite different: if your active class is 1% of your data, you’re very motivated to do some sort of example selection. Uncertainty sampling is a very convenient way to do that, without any problem-specific logic.

We’ve seen the biggest benefits from active learning in the ner.teach recipe. Running ner.teach with one of the rarer entity labels is an extremely fast way to improve the accuracy of that label — several times faster than the equivalent random sampling and manual annotation. The benefit comes from two places: the model lets us make the interface binary, and the example selection. Of course, we can’t have the binary interface without putting the model in the loop; and if we base the annotation questions around the model output, then we need to be updating the model as we go — otherwise we’d never be able to teach the model new entities it stubbornly misses.

I think the literature on active learning really misses these user-interface driven questions around the technology. It’s fairly useless to evaluate active learning by taking a subset of a training corpus, and running a simulation. The point is more like giving the user an IDE: how do we give them a smarter workflow, that makes the task easier and less error-prone? Putting the model in the loop opens up huge possibilities.

Consider this problem of having the dataset tied to the model. The concern is, “Okay, we skipped 40% of the examples because that model was confident on them. But now if we retrain, we might be missing important information!”. Fair enough. But consider: if you have a model that gets 99% accuracy on some subset of the data, how quickly do you think you can label those examples? Just stream in all the confident “Yes” predictions and say yes to them all. You’ll click through them in less than a second each, and just backspace when one that looks wrong flashes by. Doing a lot of the same thing in a row is amazingly quick. It’s also much more accurate. If you sprinkled the same confident examples randomly through a dataset, your error rate on them is likely to be higher than it will be from this speed-review process. The less time you have to focus, and the fewer mental context switches you have to make, the more accurate your decisions will be.

Finally, I must say that I found the example given very contrived. It’s also really weird to give accuracy statistics for a thought experiment :p. We’re not using an RBF kernel, and the characteristics of NLP problems are vastly different from this. In NLP the dimensionality of the feature set is enormous, and the data is dominated by common words. The task of an NLP model is very much learning to set a simple policy on the common cases, while fine-tuning on the tail end. I think active learning builds in the right type of bias for this process. You do have to be smart about the details. It’s important to always have a chance of asking questions, even if the model is confident on them — otherwise we can’t keep the model’s estimates well calibrated. It’s also important to have a model that learns quickly. The textcat model uses an ensemble of a unigram bag-of-words and a CNN, which helps quite a lot in that respect.

So in summary:

  • A good model makes annotation very fast. So if the output we want is a data set, it’s still a good idea to ask “How can I quickly get a good model?”.

  • Active learning lets you structure a feed of questions in a way that makes them quick to answer, e.g. asking a lot of questions of the same type together. This can be very helpful.

  • Many annotation problems have a needle-in-a-haystack scenario. Example selection is super important for this.

  • You don’t have to use active learning with Prodigy, and it won’t always be helpful. For instance, we strongly recommend you collect an evaluation set that’s unbiased — even if you use active learning as an intermediate step for that.


Thanks for the quick response @honnibal! I’ve been giving this a lot of thought. What I was considering was annotating a few hundred/thousand examples using the built-in active learning approach. Then building a model on these annotations and making predictions on a holdout set. Then sampling from this holdout to get a good mix of predictions at various probabilities (equal number of those in the 0-10% bucket, 10-20% bucket, etc…). Finally, annotating this dataset to see if the probabilities returned by the model are properly calibrated or if there’s some subset of the dataset that’s not performing well. Maybe I would also apply a subsampling approach to the holdout set to ensure a good variety of data-points?

I kind of feel like I’m making it up as I go along. Is there a more principled approach I could be following?

I have to say that we’re very interested in keeping our recommendations evolving too! I think it’ll take another few years for the best practices to stabilise. Remember that machine learning was a much more niche topic ten years ago. The models we’re using are also very new, as is the ecosystem of supporting software.

Andrew Ng’s writing a little primer, “Machine Learning Yearing”, that goes through a lot of practical advice: . It might be a bit basic though.

My best advise would be to train an initial model, likely using the textcat.teach with patterns. After an hour or two stop and check the accuracy, and if the accuracy is still improving quickly (using the textcat.train-curve command) continue annotating for another hour or so. During this period you won’t have an evaluation sample yet, you’ll be evaluating against a subset of the annotations. Now, those annotations are biased by the model’s selection process, so the evaluation isn’t really fair — it’s just a quick estimate to show you a before and after. So, the next step is to make a proper held-out set. Ideally you want the inputs to be in a separate file, so that you know you’re not going to have the same examples in the train and test partitions.

It’s okay to annotate the held-out set semi-automatically, e.g. by sorting them in some way. One idea is to sort them by score, so you can click through all the confident ones quickly as I suggested before. I can think of other work-orders that might be efficient. Sometimes you want to see similar examples together. Maybe it would be good to assume the model’s predictions are true, and calculate an information gain score for each word: how often does this word occur in category A, vs category B? You could then sort the words by their information gain, and annotate all the examples with that word together. This will give you patches where you’re fixing one type of error together, which should be quick. Again, this is just a suggestion — I haven’t tried it.

Once you have a stable held-out set, you can develop a workflow for getting training data that improves against it. Occasionally you might want to stop and run the process you used to create the held-out sample to give you an unbiased training sample. If you have a fast process, it’s probably worth it to make the training data easier to reason about.


Thanks! It seems like a hybrid approach for generating the holdout might be best just in case one technique introduces some bias… hopefully the others will not have the same kind of bias.