Well, it depends – if your labels are mututally exclusive, there would be a dependency between them later on when you train your model. By default, textcat.teach
doesn't treat them as exclusive. (But if they are exclusive, that's probably something you should exploit, because then the presence/absence of one label means a lot and you can take advantage of that during annotation.)
I think my main concern was that in order to see an impact from the active learning, you need a decent amount of updates per label. And you need to scan a decent amount of data to find enough positive examples. And the categories are likely unevenly distributed.
So annotating all categories at once probably wouldn't be efficient anyways, because the model suggestions you'd see and that were selected based on the scores would probably skew towards what the model happened to get the most updates for. Or they'd stay completely random all of the time, because all categories start off being just as likely, and this never really changes, because there are not enough updates.
So even though there's no direct relationship between the labels from the model's prespective, what you see and don't see can still be influenced by what other labels are present with certain scores that get selected (e.g. uncertain scores) over other labels (e.g. high/low scores).
One thing you could do that still gives you control over the model state on the back-end is to script a custom version of textcat.teach
that initialises the model with all labels and then keeps iterating over the stream, filtering by label until some criteria are met. For example, minimum of 100 annotations and/or low loss (return value of the update
callback).
Still, I think the "cleanest" solution would be to start a separate instance for each label on a separate port with its own model instance in the loop. But that's not really feasible if you actually need that many labels at the same time.
Is there no way you can break your categories down into a few top-level labels and annotate those first? For example, the first and very important distinction could be something like, is this SPORTS
or POLITICS
? That's straightforward to annotate and to train. And that's also what you want to evaluate and run experiments with first. If your model cannot learn this, it likely won't learn any more fine-grained distinctions, either, so any investment in that direction would be a waste until you've solved the underlying problem.
And once you have the top-level classifier, you can train separate classifiers that only run on the pre-classified texts, which again can make it easier for the model to learn, because there are fewer options. For instance, given a SPORTS
text, is it about SPORTS_FOOTBALL
or SPORTS_BASKETBALL
?