Imbalanced classes in a multiclass textcat leads to completely biased predictions

I’ve read through the readme and as many support topics as possible, but haven’t found anything that helps me with my particular problem.

tldr:
Training a multiclass textcat with imbalanced classes (but equal accept and reject examples) leads to predictions that always choose the classes with the most training examples.

Background:
I’m working on a proof-of-concept classifier. Using the text descriptions of technology products I’m trying to classify the products into predefined technology categories. There are over 1400 categories total, but currently, many categories only have a few products. When I implement a lower cutoff of at least 6 examples per category I’m left with 850 categories.

I have pre-formatted the dataset into JSONL with text, label, answer fields. Since classes are mutually exclusive, I augmented each class with an equal number of “reject” examples by randomly selecting product descriptions from other classes. Each class has an equal number of accept and reject examples. However, classes have an imbalanced number of examples ranging from 6 up to 280.

When I run the textcat.batch-train recipe without the label flag the model very quickly thinks it reaches 98-100% accuracy. However, when I run my holdout test set through the resulting model, the top 5 categories predicted are always the categories with the most number of training examples.

Using scikit-learn SVM and a simple bag-of-words with tfidf vectorization and one-hot encoded labels, I’m able to achieve 77% top1 categorical accuracy, and 86 top5 categorical accuracy.

While that’s pretty good, I’m pretty sure I can get better results using recent Neural Network models and I’m using SpaCy and Prodigy because I want to create a fully customized NLP pipeline for the particular register of the English language I’m working in (Technology). Ideally, my resulting SpaCy model with have fully customized core components with many additional classifiers trained on different labeling tasks. This particular classifier is just the start of this process and an attempt to establish a baseline upon which to measure improvements to enhancements made to the core components.

I started this process trying to wrap a spacy component around a Keras/Tensorflow text classification model that would output a one vs the rest classification, but got lost in the nuances of the API. So I decided to switch over to the builtin textcat since there are many more examples and questions and answers.

1 Like

I may have identified the issue here in my own code. I was producing too many ‘reject’ examples by a factor of two. So if my class had 10 ‘accept’ examples, I produced 100 ‘reject’ examples. Once I corrected my augmentation code, the results are much more stratified.

I still don’t have an intuitive understanding of why imbalanced ‘accept’ and ‘reject’ samples produces output so strongly biased towards the most populated classes, but I also haven’t dug into the guts of the textcat classifier.

One of the things that’s missing from the textcat class currently is an option to enforce mutually exclusive. If you have a lot of classes, learning non-mutually exclusive labels is quite difficult. I think that’s likely to be the root of the issue.

Here’s how to take control of the model construction for spaCy’s text categorizer. You just need to subclass, and override the Model classmethod:


class CustomTextCategorizer(spacy.pipeline.TextCategorizer):
    @classmethod
    def Model(cls, nr_class=1, width=64, **cfg):
        # This needs to return a Thinc model
        return build_text_classifier(nr_class, width, **cfg)

To make spaCy default to using your custom class, you can override the setting within the Language.factories dictionary:


Language.factories['textcat'] = lambda nlp, **cfg: CustomTextCategorizer(nlp.vocab, **cfg) 

Now when spaCy calls `nlp.create_pipe(‘textcat’), you’ll get a call to your lambda function, which will return an intance of your custom text categorizer class. This step is optional – if you’re happy to instantiate your class directly, that should work fine too.

Okay, now to build the model. The default definition of the text categorizer model can be found in spacy/_ml.py, using Thinc’s concise syntactic sugar for defining models. The part to change is this block:


        model = (
            (linear_model | cnn_model)
            >> zero_init(Affine(nr_class, nr_class*2, drop_factor=0.0))
            >> logistic
        )

The | and >> are bound to concatenate() and chain() respectively. So, what we’re doing here is forming a little ensemble: we have a linear model and the CNN, and we concatenate their output and reweight them. This is done by feeding the concatenated output forward into the Affine layer. Finally, we compress each class weight into the range [0, 1) independently, using the logistic function.

To make the classes mutually exclusive, we just need to use the Softmax class instead:


        model = (
            (linear_model | cnn_model)
            >> Softmax(nr_class, nr_class*2)
        )

I hope this helps address the accuracy problems you’re seeing. I have to say though, I’m still a little bit nervous about how the model will perform against your baselines. I’ve found this text classification architecture, with these default parameters, to perform quite well on a range of text classification problems that I’ve tried it on. I’ve been particularly pleased with the results on short texts, where bag-of-words models struggle.

However, I haven’t tried it on any problems with nearly as many categories as you’re working with. Unfortunately neural networks are still fairly fiddly. To get good performance, you may have to play with various hyper-parameters, experiment with the architecture a little, etc.

If you haven’t seen it already, you might want to check out Vowpal Wabbit It’s a very well battle-tested piece of software, that’s pretty much the go-to solution for terascale classification problems.

You might find that once you’ve created your training data with Prodigy, you actually get better efficiency and accuracy with another library. Obviously there’s lots of great ML software out there.

1 Like

Matthew,

Thanks for the swift reply! I appreciate the in-depth look under the hood. Before starting down the path of fully customizing the underlying model I decided to approach the project manager responsible for product tags. From that conversation, we determined that a broader “technology area” would still be acceptable and that a non-mutually exclusive classification would actually be preferable to an exclusive one, provided the core technology area tag was the top (or near the top) tag.

I grouped the 800+ technology tags into 189 technology area tags and built a smaller test set of approximately 5000 samples, with an evaluation set of 2000 samples. I improved my “reject” augmentation script by using spacy to calculate the similarity between documents and to choose the least similar documents to form the reject samples for a given technology area tag. I then reran the batch-train and then ran a hold out set of examples through the newly trained model and calculated top10 accuracy scores.

Top1: 57.24331926863573
TOP2: 68.35443037974683
TOP3: 73.48804500703235
TOP4: 76.93389592123769
TOP5: 79.25457102672293
Top6: 80.73136427566807
TOP7: 81.85654008438819
TOP8: 82.84106891701828
TOP9: 83.75527426160338
TOP10: 84.59915611814345
WRONG: 15.400843881856542

Better than my previous results, but still not as good as my bag-of-words model. My first thought was that I simply didn’t have enough training examples in many of the categories. And so I started trying to augment my least represented examples using textcat.train and several of your API loaders. Unfortunately, these API’s never returned any “accept” results despite trying with both seed terms and search queries (for the API’s that accept queries). I even tried your reddit loader with several different months of input data, without success.

Eventually I decided to try to customize the NewsAPI and textcat.train recipe with the following code.

import prodigy
from prodigy.components.loaders import NewsAPI
from prodigy.recipes.textcat import teach
from prodigy.models.textcat import TextClassifier
from prodigy.components.sorters import prefer_uncertain
import plac
import spacy

@prodigy.recipe('news_body', dataset=prodigy.recipe_args['dataset'], 
                spacy_model=prodigy.recipe_args['spacy_model'], 
                query=plac.Annotation("query", 'option', 'q', str),
                label=prodigy.recipe_args['entity_label'])
def news_body(dataset, spacy_model, query,  label):
    stream = NewsAPI(query=query, key='mykey', 
                     sources=['techcrunch', 'the-verge', 'recode', 'bloomberg'], content_type='description')
    nlp = spacy.load(spacy_model)
    label = ','.join(label)
    model = TextClassifier(nlp, label.split(','), long_text=False)
    stream = prefer_uncertain(model(stream))
    #components = teach(dataset=dataset, spacy_model=spacy_model, source=stream, label=','.join(label))
    #return components
    return {
        'view_id': 'classification',
        'dataset': dataset,
        'stream': stream,
        'update': model.update,
        'config': {'lang': nlp.lang, 'labels': model.labels}
    }

@prodigy.recipe('fetch_news', dataset=('ID'), query=('query'))
def fetch_news(dataset, query):
    return {
        'dataset': dataset,
        'stream': NewsAPI(query=query, key='fb8f91238f634e3f876993853d797c33', \
                     sources=['techcrunch', 'the-verge', 'recode', 'bloomberg'], content_type='description')
    }

However, while I can get the fetch_news method to display results properly, the news_body recipe (using either the teach, or the raw TextClassifier approach) just displays a continual “Loading…” screen.

Any idea what may be happening here with my custom recipe? Alternatively, are there other basic approaches I’m overlooking that I can take to improve the top 10 accuracy?

Stepping back a moment, I think it’s probably worth thinking about the model and problem definition before diving in and doing more annotation.

If you have 800 categories, then you’re probably going to have to sift through a lot of text to find enough examples to annotate. Finding a stream of text that’s of the right type is going to be pretty important. I often find that working with a specific subreddit is helpful. Crawling pages from a Google search query can also be good. If you could search for pages which link into Amazon technology products, that might also be a good place to start.

There are also some alternatives you could consider instead of having a flat list of categories. If the categories naturally come as a tree, you can classify into nodes of the tree, e.g. you have one model predict the first branching, and then more models for each subsequent decision. The models near the root get to have more training data, making the learning problem easier. A similar (and often better) approach is to reorganise into a smaller set of tags, and then have the tags define the categories. For instance, you can have a decision about whether the product is consumer oriented or professional, low end or high end, etc.

Often a large category set is inherited from something else, e.g. someone’s product catalogue. If so, there’s often a way to acquire training data semi-automatically. Sometimes you need to pivot through some other type of annotation. For instance, you might have data that’s labelled for the existence of some product, through hyper-links. If you then annotate the hyperlink targets, you can project tags from the linked page to the linking text.

One of the fundamental things that makes NLP hard is basically, “rare things are rare”. If you’re looking for reference to digital cameras in text, then you’re really looking for a needle in a haystack. Even just streaming posts from /r/technology probably doesn’t bias the sample enough towards the category you need. If you never even see your seed terms in the text, there’s an informational retrieval problem to be solved before the ML can begin.

Finally, about the model: if your bag of words model is doing better than spaCy’s CNN, you should definitely use the bag-of-words. As I said, one model can’t be good at every shape of problem — and I didn’t think about so many classes when I designed the CNN. If you do want to keep using the CNN, you might try setting low_data=True in the text classifier. This mostly uses a bag-of-words, so it might do better. You could also try increasing the width. The default width of 64 is good for many problems, but in your case you’re trying to predict a vector of 800 class probabilities, from a hidden vector of length 64. You could try 128 or 300 and see if it help.

Matthew,

I really appreciate sharing your NLP expertise. The factors you’ve mentioned have occurred to me and are on my radar, but your specific insight has helped me to think about these issues in a new way.

The reason I haven’t tried to address these data- and problem-definition issues in a wholesale manner is because I’m trying to use SpaCy and Prodigy to quickly build out a proof-of-concept, and I’m doing it on my own time. I feel that my company could benefit from advanced NLP pipelines and models (which I’m relatively proficient at), but it’s not my core responsibility, and without a decent solution, it’s hard to convince leadership to allocate my time to solving these specific problems. I believe SpaCy and Prodigy can dramatically reduce the amount of time it takes me to build out a concept, hence why I’ve invested my time and money in learning them. However, documentation is still a bit raw and examples a bit sparse, and despite reading everything available and studying all the non-compiled code, the next steps just weren’t as obvious to me as they would be if I were using tools I’m more familiar with.

I’m looking forward to those wrappers for tools like Tensorflow, but in the meantime, I appreciate the tip to tinker with the CNN architecture. Time to start digging into the thinc documentation.

Which...Isn't currently available :frowning_face: . I take your point, and wish I had more ready solutions for you.

Which…Isn’t currently available :frowning_face: . I take your point, and wish I had more ready solutions for you.

Just noticed that. Oh well. You an Ines have been cranking out an amazing amount of content and I truly appreciate both your work. I'm hoping to help out a little. I'm working on a corpus loader for the StackExchange data dump as well as a series of presentations for the Big Mountain Dev & Data Conference on how to customize a NLP pipeline with Spacy and Prodigy.

1 Like