Best practices & realistic expectations with high number of classes for multiclass text classification task

honnibal · August 23, 2019, 10:48am

It's very possible that we have some unideal settings for large-class problems, because the architectures in Prodigy were mostly optimised for datasets with fewer categories. That said, I'd expect to usually match the performance of the FastText textcat models, because our model architecture should be able to extract the same information they're extracting.

I think the most likely problem here is that the model isn't being set up to predict mutually exclusive classes, which is why you've had to generate those negative examples. If you're training the model with FastText, I'm assuming your data is such that only one label is correct per example. So the important thing is to make sure spaCy is set up with that knowledge.

I would go ahead and use spaCy directly, rather than using Prodigy textcat.batch-train, simply so that you have one less layer of software. It also means you'll be training the model with open-source tooling, which is always going to be preferable to having your automation depend on a proprietary tool (even when the proprietary tool is ours -- I couldn't give the opposite advice with a straight face ) .

You should be able to use the example script here: https://github.com/explosion/spaCy/blob/master/examples/training/train_textcat.py

Two things are important:

Make sure you're passing the "exclusive_classes": True setting.
Make sure you're setting up your "cats" dict so that one label is 1.0, and all of the other labels are provided as 0.0. In other words, you need a "dense" format, with no missing values.

@adriane has been working on the usability of the textcat to make these things easier, and to make sure the process is less error-prone. But you should already be able to see useful results quite quickly.

One outcome of the experiments Adriane has been running is that the "architecture": "bow" setting often performs very well. I would be sure to try that out first, especially in your early experiments while you're trying to get things set up correctly. It will run far faster than the "simple_cnn", which should speed up your process of getting everything correct.

Topic		Replies	Views
Imbalanced classes in a multiclass textcat leads to completely biased predictions usage , textcat	7	4017	February 21, 2018
Can't improve textcat model performance textcat	2	389	May 3, 2020
textcat.batch-train versus spacy classificaion example usage , textcat , spacy	4	543	March 30, 2019
Multi-class textcat usage , textcat	1	1021	March 27, 2018
Simple classification task with high loss usage , textcat	1	545	June 27, 2018

Best practices & realistic expectations with high number of classes for multiclass text classification task

Related topics