I am using NER and wonder whether active learning impacts the interpretation of the accuracy.
My thoughts are the following:
active learning only selects most challenging examples
the accuracy in the evaluation set might be lower than if I would use randomly drawn examples for the evaluation
that might mean, that, e.g., 65% is in reality 65+x%
Overall, the question is theoretical, with print stream I can look at the results and I see I like it. However, I was wondering when with more examples (500->1000) I just saw a minimal increase in accuracy (which might also be just right).
Sorry for the delay getting back to you on this — it slipped through, so I’m only seeing this now.
The simple answer to your question is yes, the active-learning does select a biased sample, and so for reliable estimation of accuracy you should annotate a separate data set for your held-out data, without using active learning to select the examples. The random splitting is a useful option at the start of a project as a quick-and-dirty measure of progress, but after the first day or two of work, I would suggest making a dedicated evaluation set.