how to score everything in active learning?

nlper · October 14, 2022, 9:44pm

Hi team,
According to documentation, Prodigy subsamples from unlabeled data pool (loaded from json), then score them and select uncertain ones for annotating.

I understand that it does subsampling because the pool could be huge, e.g. 10 million documents. However, if the pool is not that big, then this subsampling strategy may only leverage partial information of all unlabeled data, even it keeps sampling.

In our use case, we only have ~2000 unlabeled data in pool. It's acceptable to let annotators wait a short while for model re-training. So, can we re-score all 2000 during update, rather than sampling from 2000 first then score? Thanks.

adamkgoldfarb · October 15, 2022, 3:44pm

Could you use one of the “correct” recipes and retrain after arbitrary n samples? That’s not using active learning but sounds like what you want to do.

nlper · October 17, 2022, 5:12pm

Thanks for the prompt reply. We do use textcat.teach to train the model, and it works well so far. The question is: does it generate uncertain samples for annotation based on scoring all unlabeled data? Thanks.

nlper · October 17, 2022, 5:13pm

Plus, currently we do textcat.teach manually, are you suggesting we write some script to launch the training script automatically and periodically? It may work.

koaning · October 24, 2022, 8:52am

The textcat.teach recipe indeed currently uses the prefer_uncertain function internally. It's documented here.

You could retrain the model once in a while, but if you prefer to have more control you can also just prepare your examples.jsonl file in a Jupyter notebook upfront. That way, you can use whatever trick you like to subset your dataset to prioritise the examples however you see fit.

A somewhat extreme, albeit very useful, examples of this is shown in this recent video on bulk labelling.

Topic		Replies	Views
active learning covering all candidates custom	4	319	October 4, 2022
Active Learning: Does it work? discussion , best-practices	4	5833	May 15, 2018
Active Learning Methodology api	1	905	September 20, 2017
Taking advantage of TONS of unlabeled data usage , solved	4	848	April 3, 2019
Scoring and sorting all samples during textcat teach usage , textcat	2	535	November 2, 2020

how to score everything in active learning?

Related topics