Imbalanced classes in a multiclass textcat leads to completely biased predictions

One of the things that’s missing from the textcat class currently is an option to enforce mutually exclusive. If you have a lot of classes, learning non-mutually exclusive labels is quite difficult. I think that’s likely to be the root of the issue.

Here’s how to take control of the model construction for spaCy’s text categorizer. You just need to subclass, and override the Model classmethod:


class CustomTextCategorizer(spacy.pipeline.TextCategorizer):
    @classmethod
    def Model(cls, nr_class=1, width=64, **cfg):
        # This needs to return a Thinc model
        return build_text_classifier(nr_class, width, **cfg)

To make spaCy default to using your custom class, you can override the setting within the Language.factories dictionary:


Language.factories['textcat'] = lambda nlp, **cfg: CustomTextCategorizer(nlp.vocab, **cfg) 

Now when spaCy calls `nlp.create_pipe(‘textcat’), you’ll get a call to your lambda function, which will return an intance of your custom text categorizer class. This step is optional – if you’re happy to instantiate your class directly, that should work fine too.

Okay, now to build the model. The default definition of the text categorizer model can be found in spacy/_ml.py, using Thinc’s concise syntactic sugar for defining models. The part to change is this block:


        model = (
            (linear_model | cnn_model)
            >> zero_init(Affine(nr_class, nr_class*2, drop_factor=0.0))
            >> logistic
        )

The | and >> are bound to concatenate() and chain() respectively. So, what we’re doing here is forming a little ensemble: we have a linear model and the CNN, and we concatenate their output and reweight them. This is done by feeding the concatenated output forward into the Affine layer. Finally, we compress each class weight into the range [0, 1) independently, using the logistic function.

To make the classes mutually exclusive, we just need to use the Softmax class instead:


        model = (
            (linear_model | cnn_model)
            >> Softmax(nr_class, nr_class*2)
        )

I hope this helps address the accuracy problems you’re seeing. I have to say though, I’m still a little bit nervous about how the model will perform against your baselines. I’ve found this text classification architecture, with these default parameters, to perform quite well on a range of text classification problems that I’ve tried it on. I’ve been particularly pleased with the results on short texts, where bag-of-words models struggle.

However, I haven’t tried it on any problems with nearly as many categories as you’re working with. Unfortunately neural networks are still fairly fiddly. To get good performance, you may have to play with various hyper-parameters, experiment with the architecture a little, etc.

If you haven’t seen it already, you might want to check out Vowpal Wabbit It’s a very well battle-tested piece of software, that’s pretty much the go-to solution for terascale classification problems.

You might find that once you’ve created your training data with Prodigy, you actually get better efficiency and accuracy with another library. Obviously there’s lots of great ML software out there.

1 Like