Scoring and sorting all samples during textcat teach


I have been trying to understand how the active-learning is working under a teach recipe, specifically for the text classification case: textcat.teach. Couple of questions around it:

  1. Does the line stream = prefer_uncertain(model(stream)) (located at score and resort ALL samples in an input file (e.g. JSONL)? Or does it score and resort only batch_size number of samples from the already annotated samples?
  2. For a highly imbalanced dataset (major class being 0 in a binary classification task), is it better to use prefer_high_scores instead of prefer_uncertain to construct a more balanced dataset?


A post was merged into an existing topic: Prodigy Active Learning prefer_uncertain mechanism