The idea of the textcat.teach
recipe is that it uses the model in the loop to select the most relevant examples for annotation, based on the score (e.g. pioritising the examples with a score closest to 0.5
, as those may be the most "uncertain" predictions). This also means that the recipe will skip examples with high and low scores, so you're not going to see all examples in your dataset. The recipe will use an exponential moving average go decide which scores to consider. This prevents Prodigy from getting stuck if the model ends up in a state where it produces more high/low scores etc.
If you're starting completely from scratch with a new model and you're annotating labels that might not be equally distributed, this workflow can be less effective because your model knows nothing. And it would take very long to get enough examples of all labels to teach it something meaningful so it can actually "participate" properly.
So it might make sense to start with a manual workflow like textcat.manual
and annotate a small sample from scratch. You can then pretrain your model on that to give it a head-start. It can also help to use --patterns
on textcat.teach
to make sure that pattern matches are always shown if they occur (e.g. to show examples that may be part of rarer classes).