Active learning performs worse than pretrained model


I am experimenting with active learning on a toy dataset and I am experiencing some non intuitive behaviour which I am finding hard to debug. In particular, finetuning a spacy base model with annotations from active learning seems to degrade performance instead of improving it.

I am using the ag_news dataset and annotating ORG using en_core_web_md. I wanted to compare the speed of annotation of manual vs active learning. I annotated around 1K examples using each method and I am seeing consistent gains using manual and consistent drop using active learning.

To make matters more interesting I am actually seeing better performance in the dev set of the annotations but worse performance in a held out test dataset. I was under the impression that the random draw of manual was doing a better job to draw a representative sample than the active learning approach but when I embedded and plotted the data they seem to be nicely distributed across. No visible bias in the active learning samples that could justify this.

I experimented with both binary active learning and a custom recipe that was surfacing data using the model but I was annotating the whole segment with ORG entities and both seem to suffer from this problem.

I wonder if you have any theories as what might be causing that I could explore?

1 Like

You might enjoy this discussion on active learning:

In general, or at least in my personal experience, it feels safe to say that "active learning doesn't always work" but it's incredibly hard to tell upfront. In situations where one class is relatively rare, it seems to cause improvements. Mainly because you have some help in sampling the rare class. But in other scenarios you could argue that active learning could get stuck in a local optima depending on how everything is set up.

There are also some articles in this space that might be of interest, here's one TIL from my personal blog:

At the end of that TIL you'll also notice a benchmark with scikit-learn where "random sampling" beats an active learning strategy.

Gut Feelings: Why?!

What I'm about to describe is a gut feeling based on personal experience, but when you wonder "why?!" you might want to consider a thought experiment.

Suppose that we have a balanced classification use-case (so no rare labels). We have a dataset X that we split up into X_valid and X_train. Let's also assume that X_valid is annotated (we have y_valid) without error and we're about the annotate X_train.

Then which strategy should we apply to annotate?

  1. We should be careful with introducing sampling bias. So the best thing we can do is just sample randomly. That way, we annotated subset from X_train should resemble the same distribution as X_valid.
  2. We should do something else.

When you frame the problem this way, it suddenly sounds a bit strange to even introduce active learning, because it's introducing a bias of sorts.

When does it work?

I like to keep this argument in the back of my mind at all times, but I want to acknowledge that there is also evidence of situations where active learning does make a difference. I think the main thing is that there's not a clear consensus on the circumstances that are needed to make a big impact. I don't read papers as much, so somebody with more experience could correct me, but my understanding is that this is still an area of active research.

Hi Vincent,

Thanks for sharing your thoughts. Big fan of calmcode btw :+1:

I did read the prior discussion in the "Active Learning: Does it work?" and I found it quite interesting and helpful. As it turns out, lightag has become more acceptant of active learning lately, see ALMa: Active Learning (data) Manager

I had the same hunch as well about the random sample being bias free vs the active learning sample but embedding and plotting the data does not seem to validate any visible bias. Let me know if you have any other ideas how this hypothesis could be tested.

I did scan the literature for signs of active learning not working and I came the other way with the impression that it mostly works since not a lot of people or papers say otherwise. Or maybe they do not say in public at least :sweat_smile:

1 Like

I’m currently working through this Manning book on active learning— it’s a lucid introduction and might be helpful:

Not too math-y, but enough to get the handle.

Let me know if you have any other ideas how this hypothesis could be tested.

I'm actually working on some benchmarks! But it's in the pile together with other benchmarks, plugins, client work, open source work and ... I also recently became a dad. So no promises on a due date. :sweat_smile:

But I guess the simplest "evidence" is to pick a dataset that can get stuck in a local optimum.