I’ve read through the readme and as many support topics as possible, but haven’t found anything that helps me with my particular problem.
tldr:
Training a multiclass textcat with imbalanced classes (but equal accept and reject examples) leads to predictions that always choose the classes with the most training examples.
Background:
I’m working on a proof-of-concept classifier. Using the text descriptions of technology products I’m trying to classify the products into predefined technology categories. There are over 1400 categories total, but currently, many categories only have a few products. When I implement a lower cutoff of at least 6 examples per category I’m left with 850 categories.
I have pre-formatted the dataset into JSONL with text
, label
, answer
fields. Since classes are mutually exclusive, I augmented each class with an equal number of “reject” examples by randomly selecting product descriptions from other classes. Each class has an equal number of accept and reject examples. However, classes have an imbalanced number of examples ranging from 6 up to 280.
When I run the textcat.batch-train recipe without the label flag the model very quickly thinks it reaches 98-100% accuracy. However, when I run my holdout test set through the resulting model, the top 5 categories predicted are always the categories with the most number of training examples.
Using scikit-learn SVM and a simple bag-of-words with tfidf vectorization and one-hot encoded labels, I’m able to achieve 77% top1 categorical accuracy, and 86 top5 categorical accuracy.
While that’s pretty good, I’m pretty sure I can get better results using recent Neural Network models and I’m using SpaCy and Prodigy because I want to create a fully customized NLP pipeline for the particular register of the English language I’m working in (Technology). Ideally, my resulting SpaCy model with have fully customized core components with many additional classifiers trained on different labeling tasks. This particular classifier is just the start of this process and an attempt to establish a baseline upon which to measure improvements to enhancements made to the core components.
I started this process trying to wrap a spacy component around a Keras/Tensorflow text classification model that would output a one vs the rest classification, but got lost in the nuances of the API. So I decided to switch over to the builtin textcat since there are many more examples and questions and answers.