Multilabel text classification with more than 200 labels

sanvgupta · January 19, 2022, 11:13am

I have 227 unique labels in the dataset and each data point is having more than one relevant label. In order to avoid underfitting, I need to increase my labeled data with data augmentation. So, How can Prodigy help me speeding up the annotation process?

ines · January 19, 2022, 5:49pm

How are your labels structured, are they hierarchical? If you're annotating with this many labels, we'd usually recommend breaking up the task and start by annotating the top level categories first, since those are usually the most important. If at every step you have to think about 200+ decisions, this will slow down the process a lot and you'll probably end up with a lot of categories that are underrepresented (or not represented) in the data, which is also going to be difficult to fix with just agumentation.

So if your categories are hierarchical, one approach would be to start with the top level, annotate those and run a first training experiment. You can then drill down into the individual categories and only select from the sub-labels if you know that the top level applies. This gives you fewer options to select from and makes annotation a lot faster. If your model trained on the top level categories is good, you can even use it to do the top-level selection for you later in the process.

Here's an example of the UI you could put together for this mutli-step process: Text Classification · Prodigy · An annotation tool for AI, Machine Learning & NLP

I've also shared some thoughts on textcat annotation with large label sets here:

Topic		Replies	Views
Multi-label text classification with many labels usage , textcat	7	2414	June 30, 2020
textcat.teach for multi-class classification textcat	3	515	June 19, 2023
Best way to customize choice interface to manage 200 labels usage , textcat , front-end	1	576	June 25, 2021
Two levels of classifications for text classifications usage , textcat , custom , front-end	2	864	October 20, 2020
Custom textcat.manual to account for many labels usage , textcat	1	568	December 17, 2019

Multilabel text classification with more than 200 labels

Related topics