Taking advantage of TONS of unlabeled data

honnibal · May 7, 2018, 12:30pm

You might want to implement your own sorter method for this. The function just needs to take a sequence of (score, example) tuples and yield out a sequence of examples. Another simple thing you could try is to use prefer_uncertain(stream, algorithm='probability'). The probability algorithm randomly drops examples proportional to their score. In contrast, the default sorter doesn’t trust that the scores produced by the model are going to stay well calibrated. It asks you about examples which are higher scoring than average, and tracks that average over time. So, if you get a long sequence of low-scoring examples, the probability sorter will ask you fewer questions from them, while the exponential moving average sorter will get impatient, and start asking you questions even if the scores are low.

Topic		Replies	Views
textcat.teach not using active learning textcat , solved	9	1396	April 10, 2018
how to score everything in active learning?	4	225	October 24, 2022
Prodigy Active Learning prefer_uncertain mechanism usage , custom , pytorch	8	1851	November 23, 2020
using sorters (prefer_uncertain or prefer_high_scores) result in prodigy showing me the same data samples with different predictions usage , streams	3	584	March 13, 2021
prefer_uncertain: how does it use the stream to pick examples to score? usage , api	3	1402	December 12, 2017

Taking advantage of TONS of unlabeled data

Related topics