Multi-label text classification with many labels

I am working on a multi-label text classification with many categories (several hundreds of categories). Sofar I have setup a Prodigy task with textcat.teach recipe and a patterns file.

Now, I have two options:

  1. Starting the task with all categories with the --label option => Annotation speed gets slow if you have to adjust to the new category for each example

  2. Starting the task with one category only => Annotation is easy, but if you want to switch to a different category after say 100 examples you have to restart the task. This is not an option, because the people who are willing to help annotating are not used to working on the terminal.

Is there a solution for this issue?
E.g. switching between categories manually in the UI or switching between between categories automatically after a certain number of examples

Hi! If you have that many (top-level) categories, are you sure you want to use a workflow like textcat.teach with the model in the loop? I'm not sure you'd really get a benefit here, because it's just too many categories that all (potentially) interact with each other. So you might not get any meaningful input from the model here and it makes things much harder because now you also have to deal with the model state.

So I think a more manual approach, maybe in combination with pattern matches might work better, and it also gives you more control over how you queue up the data. If your label scheme is hierarchical, that'd be helpful, too, because then you could start with the top level categories, have your annotators assign the rough categories and then do the more fine-grained distinctions. (Also see this section on working with large label sets for text classification.)

About switching categories in the UI: Prodigy tries to avoid this, because it's typically a bad idea to have your annotators select or manipulate labels – at least, if the goal is machine learning. The labels are what your model is going to predict, so those should be defined during development, and any change here can potentially have a big impact.

Hi Ines,
thank you for your quick reply!

I think I was not precise enough with my formulation: I want to give annotators the possibility to select out of a list of predefined labels for which label they want to do annotations.

I have started out with the example of insults classification with active learning. Only modification was I used multiple labels. Can you please elaborate a bit more why active learning is not a good idea in the case of many labels? I assumed that there Spacy model in the backend treats every label as independent. Does the active learning implementation create a dependence between the labels? If so, I'd absolutely agree that it is a bad idea and switch to textcat.manual.

Well, it depends – if your labels are mututally exclusive, there would be a dependency between them later on when you train your model. By default, textcat.teach doesn't treat them as exclusive. (But if they are exclusive, that's probably something you should exploit, because then the presence/absence of one label means a lot and you can take advantage of that during annotation.)

I think my main concern was that in order to see an impact from the active learning, you need a decent amount of updates per label. And you need to scan a decent amount of data to find enough positive examples. And the categories are likely unevenly distributed.

So annotating all categories at once probably wouldn't be efficient anyways, because the model suggestions you'd see and that were selected based on the scores would probably skew towards what the model happened to get the most updates for. Or they'd stay completely random all of the time, because all categories start off being just as likely, and this never really changes, because there are not enough updates.

So even though there's no direct relationship between the labels from the model's prespective, what you see and don't see can still be influenced by what other labels are present with certain scores that get selected (e.g. uncertain scores) over other labels (e.g. high/low scores).

One thing you could do that still gives you control over the model state on the back-end is to script a custom version of textcat.teach that initialises the model with all labels and then keeps iterating over the stream, filtering by label until some criteria are met. For example, minimum of 100 annotations and/or low loss (return value of the update callback).

Still, I think the "cleanest" solution would be to start a separate instance for each label on a separate port with its own model instance in the loop. But that's not really feasible if you actually need that many labels at the same time.

Is there no way you can break your categories down into a few top-level labels and annotate those first? For example, the first and very important distinction could be something like, is this SPORTS or POLITICS? That's straightforward to annotate and to train. And that's also what you want to evaluate and run experiments with first. If your model cannot learn this, it likely won't learn any more fine-grained distinctions, either, so any investment in that direction would be a waste until you've solved the underlying problem.

And once you have the top-level classifier, you can train separate classifiers that only run on the pre-classified texts, which again can make it easier for the model to learn, because there are fewer options. For instance, given a SPORTS text, is it about SPORTS_FOOTBALL or SPORTS_BASKETBALL?

Hi @ines

We have a multi label text classification problem where there are some texts that don't fit into any of the given category. We want our model to train not only on the data that have labels but also on the data that have no labels (don't fit into any category).
While annotating data what should I do with such texts (don't fit into any category) so that they remain in training set and model would learn that such texts don't fall under any labels? Should I accept, reject or ignore? Or, should I create a new label for such type of texts (don't fit into any category). But the latter option would lead to an imbalanced data set.
Please advise

If your categories are not mutually exclusive (multiple labels can apply, or no labels), you would want your model to predict 0 (or very low scores) for all categories, to indicate that the text doesn't fit into any category. So you could annotate those by not checking any boxes.

If your categories are mutually exclusive (only one label can apply), you could have an OTHER category that should be predicted for all examples that do not fit into any category.

If your data is imbalanced in this way and many examples don't fit into any category, you could also experiment with two chained classifiers: first, train a classifier to predict whether the example is "relevant" (e.g. part of any category) and in the second step, only analyse the examples selected by the "relevant" classifier and predict one or more categories for it.

Thanks for your prompt reply, @ines. It's a multi-label problem i,e mutually exclusive. So, should I accept without selecting any label when I find an example that doesn't fit in any category?

Yes, during training, that would be interpreted as 0.0 for each category, which is the result you want.