Taking advantage of TONS of unlabeled data

mhigginslp · May 4, 2018, 3:32pm

With the current active learning setting predictions are made by batch and then the active learning algorithm decides amongst those predictions which will help the model learn most quickly (and make annotation process be fast ).

The problem is that after a relatively short period of time it is unlikely that the active learning algorithm will find helpful candidate annotations in each batch - it just finds stuff that it is already quite confident about.

Do you have any suggestions for getting the active learning system to look at a bigger sample of data or skip entire batches before presenting text to the annotator?

honnibal · May 7, 2018, 12:30pm

You might want to implement your own sorter method for this. The function just needs to take a sequence of (score, example) tuples and yield out a sequence of examples. Another simple thing you could try is to use prefer_uncertain(stream, algorithm='probability'). The probability algorithm randomly drops examples proportional to their score. In contrast, the default sorter doesn’t trust that the scores produced by the model are going to stay well calibrated. It asks you about examples which are higher scoring than average, and tracks that average over time. So, if you get a long sequence of low-scoring examples, the probability sorter will ask you fewer questions from them, while the exponential moving average sorter will get impatient, and start asking you questions even if the scores are low.

akshitasood63 · April 3, 2019, 11:31am

Hi,
When using prefer_low_scores, the sorter will not stream the data with high scores, right? So, that removed data or the predictions of that confident data will be saved somewhere or not ?
If not, and the confident data is discarded, then don’t you think the training data distribution might get biased and might be skewed away from the bigger classes? Is that a matter of concern for the models?

ines · April 3, 2019, 12:33pm

No, the filters like prefer_uncertain or prefer_high_scores decide what examples in your incoming data to send out. So everything that's not sent out doesn't get annotated and won't be saved in the database.

This is actually kind of the purpose of using a sorter like prefer_high_scores. In cases where you have a very imbalanced distribution, you might want to explicitly bias the examples selection to achieve better results.

akshitasood63 · April 3, 2019, 4:09pm

Great. Thanks alot!!

Topic		Replies	Views
prefer_uncertain: how does it use the stream to pick examples to score? usage , api	3	1402	December 12, 2017
Prodigy Active Learning prefer_uncertain mechanism usage , custom , pytorch	8	1852	November 23, 2020
Scoring and sorting all samples during textcat teach usage , textcat	2	534	November 2, 2020
active learning covering all candidates custom	4	318	October 4, 2022
how to score everything in active learning?	4	225	October 24, 2022

Taking advantage of TONS of unlabeled data

Related topics