Hi team,
According to documentation, Prodigy subsamples from unlabeled data pool (loaded from json), then score them and select uncertain ones for annotating.
I understand that it does subsampling because the pool could be huge, e.g. 10 million documents. However, if the pool is not that big, then this subsampling strategy may only leverage partial information of all unlabeled data, even it keeps sampling.
In our use case, we only have ~2000 unlabeled data in pool. It's acceptable to let annotators wait a short while for model re-training. So, can we re-score all 2000 during update, rather than sampling from 2000 first then score? Thanks.
Could you use one of the “correct” recipes and retrain after arbitrary n samples? That’s not using active learning but sounds like what you want to do.
Thanks for the prompt reply. We do use textcat.teach to train the model, and it works well so far. The question is: does it generate uncertain samples for annotation based on scoring all unlabeled data? Thanks.
Plus, currently we do textcat.teach manually, are you suggesting we write some script to launch the training script automatically and periodically? It may work.
The textcat.teach recipe indeed currently uses the prefer_uncertain function internally. It's documented here.
You could retrain the model once in a while, but if you prefer to have more control you can also just prepare your examples.jsonl file in a Jupyter notebook upfront. That way, you can use whatever trick you like to subset your dataset to prioritise the examples however you see fit.
A somewhat extreme, albeit very useful, examples of this is shown in this recent video on bulk labelling.