textcat.teach for multi-class classification

ryanwesslen · June 19, 2023, 1:33pm

So, first you'd need to create your first model.

First, make sure your annotations are in the right format for spaCy or Prodigy.

Here's a post where we go through the data format:

You'll need to make a decision if you want your classes are mutually exclusive (textcat) or not (textcat_multilabel) which can affect how the data is formatted. See that post for more details.

Then once your annotations are into the Prodigy DB, use data-to-spacy to export out the annotations and train with spacy train. You can train with prodigy train instead -- it has the advantage is it's quicker to start, but it's harder to reconfigure down the road as it hides the config file so I tend to recommend learning data-to-spacy/spacy train earlier.

Also you may want very early to create a dedicated holdout (evaluation) dataset. This will make your experiments down the road much easier to read as your evaluation dataset is staying the same. If you don't specify a dedicated hold out dataset, Prodigy will create a random partition for evaluation. However, this can change each time so if you rerun, you may get different results simply due to a new holdout (evaluation) dataset. Be sure to use the eval: prefix with either data-to-spacy or prodigy train, e.g., --textcat dataset,eval:eval_dataset.

Sort of. Here's a post (see slides, which cover NER but same idea applies to textcat) that provide some detail. Essentially, updating is made for the known (binary) labels and the other labels are treated as missing.

What's worth mentioning is that this approach was designed for a reasonable number of labels; when working with 100+ labels, I'm a little more skeptical on how well this would work (especially if you don't have a well trained model already, then doing textcat.teach). The problem is textcat.teach assumes you have a model that can measure uncertainty well, that is it "knows" what it doesn't know. The problem is if you have only a small amount of data across all labels, especially perhaps imbalanced for many labels, it's hard for textcat.teach to work well if it doesn't know what it doesn't know (aka it can't measure uncertainty well).

I would recommend a "bottom-up" approach, where you start with good/well-balanced labels, and then only add imbalanced/rare/poor-performing labels slowly:

before applying to your entire dataset (100+ label), start with a small subset of labels (6-8) that you know you have a good number of labels. Train an initial model on only those labels. This will give you a good idea of a good benchmark model. Maybe if one or two of those labels aren't performing as high as you'd like, you could add more annotations for them and then retrain a new model from scratch.
Then, slowly expand to add more labels in small groups. You'll first need to add them into your training, then use prior knowledge to focus on the labels that need the most help -- for example, severely imbalanced or poor model performance.

Consider using textcat.correct too instead of textcat.teach early on. textcat.correct will still use the model's prediction in the UI (so makes it a little easy as your job is to correct). You can even pass a --threshold parameter where you consider annotations based on some threshold.
Only use textcat.teach when you have sufficient number of examples for that label. Also consider using patterns (see docs) in combination.

Hope this helps!

Topic		Replies	Views
Multi-label text classification with many labels usage , textcat	7	2415	June 30, 2020
How to do multiclass textcat? usage , textcat	8	4756	May 25, 2018
Multilabel text classification with more than 200 labels usage , textcat	1	708	January 19, 2022
Is textcat.teach (as out-of-the-box) appropriate with multilabel tasks? textcat , solved	4	338	June 28, 2022
Custom textcat.manual to account for many labels usage , textcat	1	568	December 17, 2019

textcat.teach for multi-class classification

Related topics