I have been trying to understand how the active-learning is working under a
teach recipe, specifically for the text classification case:
textcat.teach. Couple of questions around it:
- Does the line
stream = prefer_uncertain(model(stream))(located at https://github.com/explosion/prodigy-recipes/blob/0037b32d954e0b1672f9dae1e8aa53ac0c9136e3/textcat/textcat_custom_model.py#L63) score and resort ALL samples in an input file (e.g. JSONL)? Or does it score and resort only
batch_sizenumber of samples from the already annotated samples?
- For a highly imbalanced dataset (major class being 0 in a binary classification task), is it better to use
prefer_uncertainto construct a more balanced dataset?