Help needed to get started with text classification

Okay so, there’s definitely a problem with your code: you’re not minibatching the inputs; you’re calling “update” on the whole data. The nlp.update method performs a single gradient-descent step – so you’re only making 200 updates, with each update estimated on the whole dataset. You need to change this to:

from spacy.util import minibatch
    for i in range(10):
        losses = {}
        annotations = [get_cats(label, labelset) for label in train_labels]
        dataset = zip(train_texts, annotations)
        for batch in minibatch(dataset, size=8):
            batch_texts, batch_annots = zip(*batch)
            nlp.update(batch_texts, batch_annots, sgd=optimizer, drop=0.2,
                   losses=losses)

A more general point as well: you should probably try to benchmark against a bag-of-words model, probably using something like scikit-learn. I’ve been wondering what we can do to make this more transparent.

Let’s say the neural net model gets you 45%. Then you run a bigram bag-of-words model, and come up with 72% using scikit-learn, with the normal stop-words removal, tf-idf weighting, etc. So you run more hyper-parameter search, and then the neural net gets 73%. More hyper-parameter tuning on the scikit-learn model gets you 78%, while you can come up with from spaCy is 77.8%.

If the bigram model instead got you 40%, you’d probably end the hyper-parameter search from the neural network sooner, as you’d rightly conclude the problem’s probably hard and it’ll be tough to do much better. My point here is that one of the problems with hyper-parameter search is not knowing what you “ought” to be getting, which makes it hard to know whether you’re in roughly the right sort of hyper-parameter space.

If you know you’re 30% behind where you could be, that really changes what sort of decisions you explore. More epochs won’t get you 30% accuracy. Changing the batch size or the learning rate might though. So, life is a lot easier if you have a comparison point. Running some experiments with a simpler model is really good for framing the problem.