How much training data for multiclass/multilabel text classification?

Good day,

We tried training our textcat_multilabel model (8 non-exclusive classes) in prodigy and got a low score of 0.41. We currently have 1000 examples, split into 80-20 training/eval. Is there a way for us to know how much more data we need to get a much higher accuracy/score? Or is it more of a trial and error?


hi @joebuckle!

Yes! Have you tried the train-curve recipe?

This recipe is designed to test how accuracy improves with more annotated data. The key is to look for the shape of the training curve. If you find your curve is increasing near the end (e.g., last 25%), this indicates you may get incremental value (information) by labeling more. However, if you find the curve is starting to "level off", this can indicate "diminishing marginal returns". Said differently, there isn't a lot of value of labeling more. If you want to improve your model, you may find you need to rethinking your annotation scheme (e.g., change your class definitions).

You can find several support issues that help in the interpretation and use of train-curve:

There are also tips on customizing train-curve, like adding label stats and saving results:

Or modifying the evaluation metrics:

Are you creating your own evaluation dataset on your own or are you allowing Prodigy to do it for you automatically?

As the previous post mentions, you may want to create a dedicated hold-out (evaluation) dataset if you haven't already. In the train docs, there's this tip:

For each component, you can provide optional datasets for evaluation using the eval: prefix, e.g. --ner dataset,eval:eval_dataset. If no evaluation sets are specified, the --eval-split is used to determine the percentage held back for evaluation.

Let me know if you have any questions on how to write a script for this.

As you may have tried, to improve specific labels, you can use active learning (textcat-teach), model-in-the-loop predictions (e.g., textcat.correct), or patterns (rules). Here are the docs for doing this with text classification.

Last, if you're working with multiple annotators, another approach to answering " How much data do I need to label?" is to consider bootstrapping for inter-rater reliability. My colleague Peter Baumgartner recently wrote an interesting blog post:

It's important to note that bootstrapping is a general concept that can be used for any statistic, but is typically computationally intensive which is the limiting factor. For example, you could "bootstrap" (sample with replacement) train-curve which would provide uncertainty estimates on accuracy. The problem is this may take a very long time to estimate.

It seems that the model is improving at the last segment when we run train-curve. So we will try first to add more annotated examples to the dataset.

We are using --eval-split (0.2) parameter when training. Does it matter if you use a --base-model during training?

You mentioned about "rethinking your annotation scheme (e.g., change your class definitions)". Can you give an example of this?


If you're using --eval-split, you're not creating a dedicated hold out dataset. It's not based on the --base-model.

When you use --eval-split, you're allowing Prodigy to take your dataset, and randomly split it. This is okay early on - but if you run multiple experiments, you'll find that you don't have a fixed / set hold out dataset (that is, each time you run, you get a new 20% holdout). This will likely cause weird results on your train-curve where you run it one time, and the curve is increasing. While a second time (with the exact same annotated data), the curve is different because you have a different holdout.

There's two ways you can do this. The easiest is if you label a new set of data and put it into a Prodigy dataset (let's call that data_eval). Then when you run Prodigy, you can use the eval: to set your eval dataset.

python -m prodigy train-curve --textcat train_data,eval:eval_data

If you run this, Prodigy won't randomly create your evaluation dataset. Instead, it'll always use the eval_data. This is why we call it a "dedicated" evaluation set as it's dedicated for only training.

An alternative would be to partition your existing training data (say dataset) and take a random x% into a new Prodigy dataset. I've written a small snippet of code to do this:

from prodigy.components.db import connect
import random

# setting seed for reproducibility

db = connect()

# change for name of dataset: assume `dataset``
dataset = db.get_dataset("dataset")
eval_split = 0.2
train_data = [e for e in dataset if random.random() > eval_split]
eval_data = [e for e in dataset if random.random() <= eval_split]

db.add_examples(eval_data, ["eval_dataset"])

db.add_examples(train_data, ["train_dataset"]) 

This assumes your existing annotated data is named dataset in Prodigy. What this code does is it'll partition your data and then create two new datasets: train_dataset and eval_dataset.

Now you can run:

python -m prodigy train-curve --textcat train_dataset,eval:eval_dataset

Annotation schemes are explicit definitions of what you're trying to categorize. These are particularly important when you have multiple annotators to ensure that all annotators have a consistent mental model of what they're annotating. It can also be helpful even if you're the only one annotating as it can help you be explicit with what definition you're using in your annotations.

I would highly recommend Matt's 2018 PyData Berlin talk. I've posted the video to start a the middle of the talk where he has the example of framing a NER task so that his annotation scheme is aligned to his business problem.

He discusses how he can change his definition of his entity scheme (in his example from annotation crime_location to location) can help improve his model's performance. This is a perfect case of how a different definition of his scheme (i.e., entity type) can help lead to a better

We've recently published a recent case study on how the Guardian defined their annotations schemes into annotation guidelines when using Prodigy that helped their annotations have clear definitions of what they were annotating for quote extraction.

The key message is: if annotators don't have a clear, well-defined definition of what they're labeling (e.g., different categories in textcat or entity types in ner), very likely your model may not train as well as it would with consistent annotations. Moreover, as Matt highlights, what's important is framing your categories to align with a way in which the model can naturally learn easier. This is why when experimenting, you may find if you hit a plateau on your accuracy, what may help to break through isn't more data or more high-powered algorithms: it's rethinking your annotation scheme.

Hope this helps!