Hi,
I have been trying to understand how the active-learning is working under a teach
recipe, specifically for the text classification case: textcat.teach
. Couple of questions around it:
- Does the line
stream = prefer_uncertain(model(stream))
(located at https://github.com/explosion/prodigy-recipes/blob/0037b32d954e0b1672f9dae1e8aa53ac0c9136e3/textcat/textcat_custom_model.py#L63) score and resort ALL samples in an input file (e.g. JSONL)? Or does it score and resort onlybatch_size
number of samples from the already annotated samples? - For a highly imbalanced dataset (major class being 0 in a binary classification task), is it better to use
prefer_high_scores
instead ofprefer_uncertain
to construct a more balanced dataset?
Thanks!