How to do multiclass textcat?

Prodigy supports annotating multiple classes or labels at once, so you can do something like:

prodigy textcat.teach my_dataset en_core_web_sm my_source.jsonl --label POLITICS,ECONOMY

You can always keep adding more examples of different labels to the same dataset. When you use the textcat.batch-train command, Prodigy will read all available classes from the ones available in your dataset and train them.

When using Prodigy for text classification, there’s no explicit need for the spaCy model to know the classes beforehand. Depending on the data you’re working with and the classes you want to annotate, it might make sense to start off with a terminology list, which you can bootstrap using the terms.teach recipe. The list could either cover all classes, or you could create one for each class (depending on the data and how fine-grained the categories are). If you haven’t seen it yet, check out the end-to-end example of training an insults classifier with Prodigy. The example only covers two classes (“insult” and “not insult”), but the same approach should work for a multi-class task as well.

Ultimately, it all comes down to experimenting with what works best on your data – and Prodigy can hopefully help with that :blush:

Btw, a quick note on the annotation strategy: To make the most of the binary annotation UI, we generally recommend not annotating too many classes at once, especially if they’re very different content-wise. Moving through the examples quickly works best if you (or the annotator) can focus on one objective at a time and doesn’t have to spend much time reading and analysing the annotation task. For example, if you’re annotating whether a text is about food or about cars, switching between those objectives on each decision can make annotation less effective, so it might be better to annotate both classes separately. (This is mostly a UX psychology consideration, though.)

2 Likes