Multilabel text classification with more than 200 labels

I have 227 unique labels in the dataset and each data point is having more than one relevant label. In order to avoid underfitting, I need to increase my labeled data with data augmentation. So, How can Prodigy help me speeding up the annotation process?

How are your labels structured, are they hierarchical? If you're annotating with this many labels, we'd usually recommend breaking up the task and start by annotating the top level categories first, since those are usually the most important. If at every step you have to think about 200+ decisions, this will slow down the process a lot and you'll probably end up with a lot of categories that are underrepresented (or not represented) in the data, which is also going to be difficult to fix with just agumentation.

So if your categories are hierarchical, one approach would be to start with the top level, annotate those and run a first training experiment. You can then drill down into the individual categories and only select from the sub-labels if you know that the top level applies. This gives you fewer options to select from and makes annotation a lot faster. If your model trained on the top level categories is good, you can even use it to do the top-level selection for you later in the process.

Here's an example of the UI you could put together for this mutli-step process: Text Classification · Prodigy · An annotation tool for AI, Machine Learning & NLP

I've also shared some thoughts on textcat annotation with large label sets here: