How much training data for multiclass/multilabel text classification?

If you're using --eval-split, you're not creating a dedicated hold out dataset. It's not based on the --base-model.

When you use --eval-split, you're allowing Prodigy to take your dataset, and randomly split it. This is okay early on - but if you run multiple experiments, you'll find that you don't have a fixed / set hold out dataset (that is, each time you run, you get a new 20% holdout). This will likely cause weird results on your train-curve where you run it one time, and the curve is increasing. While a second time (with the exact same annotated data), the curve is different because you have a different holdout.

There's two ways you can do this. The easiest is if you label a new set of data and put it into a Prodigy dataset (let's call that data_eval). Then when you run Prodigy, you can use the eval: to set your eval dataset.

python -m prodigy train-curve --textcat train_data,eval:eval_data

If you run this, Prodigy won't randomly create your evaluation dataset. Instead, it'll always use the eval_data. This is why we call it a "dedicated" evaluation set as it's dedicated for only training.

An alternative would be to partition your existing training data (say dataset) and take a random x% into a new Prodigy dataset. I've written a small snippet of code to do this:

from prodigy.components.db import connect
import random

# setting seed for reproducibility
random.seed(123)

db = connect()

# change for name of dataset: assume `dataset``
dataset = db.get_dataset("dataset")
eval_split = 0.2
train_data = [e for e in dataset if random.random() > eval_split]
eval_data = [e for e in dataset if random.random() <= eval_split]

db.add_dataset("eval_dataset") 
db.add_examples(eval_data, ["eval_dataset"])

db.add_dataset("train_dataset") 
db.add_examples(train_data, ["train_dataset"]) 

This assumes your existing annotated data is named dataset in Prodigy. What this code does is it'll partition your data and then create two new datasets: train_dataset and eval_dataset.

Now you can run:

python -m prodigy train-curve --textcat train_dataset,eval:eval_dataset

Annotation schemes are explicit definitions of what you're trying to categorize. These are particularly important when you have multiple annotators to ensure that all annotators have a consistent mental model of what they're annotating. It can also be helpful even if you're the only one annotating as it can help you be explicit with what definition you're using in your annotations.

I would highly recommend Matt's 2018 PyData Berlin talk. I've posted the video to start a the middle of the talk where he has the example of framing a NER task so that his annotation scheme is aligned to his business problem.

He discusses how he can change his definition of his entity scheme (in his example from annotation crime_location to location) can help improve his model's performance. This is a perfect case of how a different definition of his scheme (i.e., entity type) can help lead to a better

We've recently published a recent case study on how the Guardian defined their annotations schemes into annotation guidelines when using Prodigy that helped their annotations have clear definitions of what they were annotating for quote extraction.

The key message is: if annotators don't have a clear, well-defined definition of what they're labeling (e.g., different categories in textcat or entity types in ner), very likely your model may not train as well as it would with consistent annotations. Moreover, as Matt highlights, what's important is framing your categories to align with a way in which the model can naturally learn easier. This is why when experimenting, you may find if you hit a plateau on your accuracy, what may help to break through isn't more data or more high-powered algorithms: it's rethinking your annotation scheme.

Hope this helps!