Help needed to get started with text classification

This is our first time using prodigy, so there are probably some stupid questions below...

We want to train a model to classify texts into multiple (around 10) issue categories. We have 1200 data hand coded (within an external application). We thought that the basic setup should be something like:

  1. Import the 1200 manual codings.
  2. Check performance of the classification with textcat.batch-train and/or textcat.train-curve.
  3. Assuming the model isn't good enough, add more data with textcat.teach from unannotated data.
  4. Repeat steps 2 and 3 until happy.

Question 1: Does this sound like the right way to use prodigy?

Question 2: We tried doing steps 1 and 2, but the performance is immediately ~100%. It seems like the model tries to predict accept/reject rather than our issue categoires. See prodigy.sh ¡ GitHub for the code we used.

Are we using the wrong commands?

Question 3 (probably related to 2): we are inputting the data like so:

{"text":"Podcast van 28 februari[...] vermissing, VVD","label":"wonen","meta": {"":"1","id":"188159157","medium":"1almere"}}

Is that the correct format, given that the target label is 'wonen'?
(never mind the silly "": "1", which are R rownames, but they don't seem to cause the problem)

Question 4: We also have a dictionary of terms for each issue, and a structural topic model trained with topics that correspond (somewhat) with the target issues identified. Does it make sense to somehow input these into the initial model as well, and how would we do this?

Question 5: Do we need to specify the spacy model ("nl"?) and/or indicate what we think are good features?

Yay, that's nice to hear! Your workflow looks really good so far, so definitely keep us updated about the results. Answers below:

Yes, that sound like a good plan. How well this will work obviously depends on that data you have etc. etc., but being able to pre-train a model is always nice, since you won't have to deal with the cold start problem.

Yes, the problem in your case is that you don't have any negative examples – and Prodigy is optimised to train from binary data and sparse annotations. So the model here simply learned that "every label is correct", which is true – but obviously not generalisable.

One solution would be to add negative examples, e.g. by swapping out the labels. But you'll probably find it more efficient to just use spaCy directly – here's a simple code example. (In spaCy v2.0, all components share the same training API, so you can also take inspiration from the other examples.) Once you have a pre-trained model that predicts something, you can load it with textcat.teach and keep improving it on new data.

Yes, that's all correct. We usually write all our labels in caps, e.g. WONEN, but this is only a stylistic thing and doesn't actually matter.

The terms dictionary could be very useful to bootstrap more training data and select examples from very large corpora. The textcat.teach recipe supports a --patterns argument that can point to a JSONL file of patterns that look like this:

{"label": "GERMANY", "pattern": [{"lower": "berlin"}]}
{"label": "USA", "pattern": "New York"}

The patterns can either be a list of dictionaries, with one dictionary describing a token and its attributes (just like the patterns for spaCy's rule-based Matcher), or exact strings. Using the patterns, you can give examples of words that are likely indicators of a category (e.g. texts including "berlin" are likely about Germany). You may come across false positives, too – but this is good, because you also want your model to learn about those cases.

If you're working with Dutch text, you probably want to start off with the small Dutch model, nl_core_news_sm. If you don't care about the other components (tagger, parser, NER) and only want to train the text classifier, you can also just save out a "blank" model instead:

import spacy
nlp = spacy.blank('nl')

Prodigy's annotation recipes can take the name of a model package or a path to a model directory – so you can simply pass in the directory containing the pre-trained model that you saved out.

Thanks Ines for your quick reply! We continue and keep you posted.

1 Like

@Ines, thanks again for your reply!

I’ve managed to get a textcat model trained in spacy, so the next step will be loading this into prodigy and starting the active learning process, but I’m not quite sure if I did everything right. So if anyone could have a short peek at my code or questions below it would be much appreciated.

Code is at https://gist.github.com/vanatteveldt/1a2aa9c470ca64f8bf5969a83c28d16a.

My data consists of 1241 newspaper articles coded into 17 issue categories (1 category per article), with 4 categories <=30 examples and 5 categories >= 100.

The spacy model converges on an accuracy of 45% on test data (65% on train), reaching 40% accuracy after around 130 iterations.

I didn’t expect performance to be much better given the low number of training examples relative to the number of classes. However, I do have some questions:

  1. I now create a gold set per article consisting of {'cats': {'LABEL': True, 'OTHERLABEL': False, ...}}, i.e. I set a positive label for the manual code, and a negative label for all others. Is that correct?
  2. Is there a way to tell the model that there is a single class per document?
  3. Are there any hyperparameters, feature cleaning, NN design etc. choices that I should look at, or should the defaults be good for this use case?
  4. Is it expected that performance is on par with a simple SVM model?
    [FYI, code and results at https://gist.github.com/vanatteveldt/3bf403c8f3c1f2195f8eb7ca22f33b6c]
  5. Is it normal/expected that it takes >100 iterations to converge?

Sorry for the many questions! I’ve trained a lot of ‘regular’ text classification models, but as this is my first venture into training my own spacy models and I’m not quite sure about all the design decisions, I feel a bit uncertain about these issues…

(edit: for some reason it says this is hidden as spam, but I can’t really understand what’s spammy about it, and I don’t see any comments for why it would be?)

Okay so, there’s definitely a problem with your code: you’re not minibatching the inputs; you’re calling “update” on the whole data. The nlp.update method performs a single gradient-descent step – so you’re only making 200 updates, with each update estimated on the whole dataset. You need to change this to:

from spacy.util import minibatch
    for i in range(10):
        losses = {}
        annotations = [get_cats(label, labelset) for label in train_labels]
        dataset = zip(train_texts, annotations)
        for batch in minibatch(dataset, size=8):
            batch_texts, batch_annots = zip(*batch)
            nlp.update(batch_texts, batch_annots, sgd=optimizer, drop=0.2,
                   losses=losses)

A more general point as well: you should probably try to benchmark against a bag-of-words model, probably using something like scikit-learn. I’ve been wondering what we can do to make this more transparent.

Let’s say the neural net model gets you 45%. Then you run a bigram bag-of-words model, and come up with 72% using scikit-learn, with the normal stop-words removal, tf-idf weighting, etc. So you run more hyper-parameter search, and then the neural net gets 73%. More hyper-parameter tuning on the scikit-learn model gets you 78%, while you can come up with from spaCy is 77.8%.

If the bigram model instead got you 40%, you’d probably end the hyper-parameter search from the neural network sooner, as you’d rightly conclude the problem’s probably hard and it’ll be tough to do much better. My point here is that one of the problems with hyper-parameter search is not knowing what you “ought” to be getting, which makes it hard to know whether you’re in roughly the right sort of hyper-parameter space.

If you know you’re 30% behind where you could be, that really changes what sort of decisions you explore. More epochs won’t get you 30% accuracy. Changing the batch size or the learning rate might though. So, life is a lot easier if you have a comparison point. Running some experiments with a simpler model is really good for framing the problem.

Excellent, thanks for the reply.

I’ve updated the code to use the minibatching: https://gist.github.com/vanatteveldt/cf5d776b17dc84b6e4c8a6fe3c785d88

It gets around 65% accuracy on the held-out data and reaches that after 10 iterations.

I’ll run a baseline model as well, see how that performs. Wouldn’t you expect the spacy model to outperform a simple bag-of-words baseline?

It depends. Bigram bag-of-words models do pretty well at topic-based classification of news. In topic classification, there's not much advantage to capturing things like sentence structure. Negation doesn't matter, and the topics are very seldom going to be determined by things like metaphor or allusion. Word presence and absence is really the main thing, and news tends to use the same words for the same topics, to orient readers. So it's a quite ideal case for the bag-of-words models.

Beating bag-of-words models on these tasks is still possible, if the hyper-parameters are tuned well. But if the hyper-parameters are tuned poorly, the model might do much worse than the bag-of-words. So it's important to know how a bag-of-words model does, to know whether you're way behind the accuracy you should be getting.

Cool, thanks.

Is there documentation on which hyperparameters can be tuned and what sensible ranges might be? The code contains the drop and size parameters, but there are probably also parameters for the layers etc of the neural net?

Edit: I found your answer here [Imbalanced classes in a multiclass textcat leads to completely biased predictions] which subclasses the textcat model to change the topology. Is that what I should be looking at? I guess I could also decide to add a softmax to enforce one-topic-per-document (assuming that’s what we want), but that probably means changing / copy-pasting the _ml.build_text_classier, right?

Hi guys,

This looks like the correct place to ask my question :slight_smile:

I have a question regarding the minibatching that is beeing discussed in the latest posts of this thread and also used in the example train_textcat.py

The main training loop looks like this:

    for i in range(n_iter):
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(train_data, size=compounding(4., 32., 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, drop=0.2,
                           losses=losses)
            with textcat.model.use_params(optimizer.averages):
                # evaluate on the dev data split off in load_data()
                scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)

I was woundering why do you consume all batches of the minibatch in one interation instead of consuming one batch per iteration of the main loop? Following code should expain what i mean.


        # batch up the examples using spaCy's minibatch
        batches = minibatch(train_data, size=compounding(4., 32., 1.001))
        for i, texts, annotations in zip(range(n_iter),*batch):
            losses = {}
            nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses)
            with textcat.model.use_params(optimizer.averages):
                # evaluate on the dev data split off in load_data()
                scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)

Thanks in advance!

Your Environment

  • spaCy version: 2.0.12
  • Platform: Windows-10-10.0.14393-SP0
  • Python version: 3.6.5
  • Models: de

@agh92 Replied to your spaCy thread: https://github.com/explosion/spaCy/issues/3151

@honnibal Thanks a lot!