Active Learning: Does it work?

Changing the active learning is very easy: it’s just a sort function that the feed gets filtered through. So you can write any function that takes an iterable of (score, example) pairs and produces an iterable of examples, and put that after the model. Prodigy will take care of keeping the model updated in the loop, so you only have to write the function itself — which is very easy.

I think it’s true that a purely online service will probably struggle to make active learning useful in their workflow. If the tool is entirely online, then it’s difficult to switch between different modes, and it’s difficult to start and stop the server to run a batch-train process. It’s also difficult to interact with the tool programmattically — which I think is super useful.

I think for text classification with roughly balanced classes, it’s pretty uncertain whether active learning will help. But if you’ve got a number of rarer classes, the situation is quite different: if your active class is 1% of your data, you’re very motivated to do some sort of example selection. Uncertainty sampling is a very convenient way to do that, without any problem-specific logic.

We’ve seen the biggest benefits from active learning in the ner.teach recipe. Running ner.teach with one of the rarer entity labels is an extremely fast way to improve the accuracy of that label — several times faster than the equivalent random sampling and manual annotation. The benefit comes from two places: the model lets us make the interface binary, and the example selection. Of course, we can’t have the binary interface without putting the model in the loop; and if we base the annotation questions around the model output, then we need to be updating the model as we go — otherwise we’d never be able to teach the model new entities it stubbornly misses.

I think the literature on active learning really misses these user-interface driven questions around the technology. It’s fairly useless to evaluate active learning by taking a subset of a training corpus, and running a simulation. The point is more like giving the user an IDE: how do we give them a smarter workflow, that makes the task easier and less error-prone? Putting the model in the loop opens up huge possibilities.

Consider this problem of having the dataset tied to the model. The concern is, “Okay, we skipped 40% of the examples because that model was confident on them. But now if we retrain, we might be missing important information!”. Fair enough. But consider: if you have a model that gets 99% accuracy on some subset of the data, how quickly do you think you can label those examples? Just stream in all the confident “Yes” predictions and say yes to them all. You’ll click through them in less than a second each, and just backspace when one that looks wrong flashes by. Doing a lot of the same thing in a row is amazingly quick. It’s also much more accurate. If you sprinkled the same confident examples randomly through a dataset, your error rate on them is likely to be higher than it will be from this speed-review process. The less time you have to focus, and the fewer mental context switches you have to make, the more accurate your decisions will be.

Finally, I must say that I found the example given very contrived. It’s also really weird to give accuracy statistics for a thought experiment :p. We’re not using an RBF kernel, and the characteristics of NLP problems are vastly different from this. In NLP the dimensionality of the feature set is enormous, and the data is dominated by common words. The task of an NLP model is very much learning to set a simple policy on the common cases, while fine-tuning on the tail end. I think active learning builds in the right type of bias for this process. You do have to be smart about the details. It’s important to always have a chance of asking questions, even if the model is confident on them — otherwise we can’t keep the model’s estimates well calibrated. It’s also important to have a model that learns quickly. The textcat model uses an ensemble of a unigram bag-of-words and a CNN, which helps quite a lot in that respect.

So in summary:

  • A good model makes annotation very fast. So if the output we want is a data set, it’s still a good idea to ask “How can I quickly get a good model?”.

  • Active learning lets you structure a feed of questions in a way that makes them quick to answer, e.g. asking a lot of questions of the same type together. This can be very helpful.

  • Many annotation problems have a needle-in-a-haystack scenario. Example selection is super important for this.

  • You don’t have to use active learning with Prodigy, and it won’t always be helpful. For instance, we strongly recommend you collect an evaluation set that’s unbiased — even if you use active learning as an intermediate step for that.

7 Likes