Evaluating a text classification model


I am trying to classify text into 10 categories. I have example data that I am feeding into Prodigy and I am training a model with that data. The process I am following is:

  • Creating around 25k accept and reject examples.

  • Creating a dataset with those examples with db-in

  • Using textcat.batch-train recipe to get a the model I am going to use for predicting.

I am having 0.91 as accuracy and f-score ( Correct 5162 and Incorrect 509)

But when I use the resulting model for predicting categories in unseen and seen text (just for checking), I can verify that it predicts correctly only around the 80% of my seen data, even if it is the data I have been training and testing my model with. How is it possible to have such a difference? Is it not supposed to be near 90%?

How can I improve my model?


How exactly are you checking and calculating this? What do you count as "correct"? And how did you evaluate the model after training – did you use a dedicated evaluation set or just let it hold back some portion of the data?

The thing is, when you evaluate a model, the score reflects how accurate the model's predictions are on the given evaluation set. If the evaluation set is good and representative, those results should ideally generalise to other similar datasets. So if you ran the same evaluation again on that data, you should get similar results. If the evaluation set is unideal, it may produce less reliable results.

If you're serious about evaluating and drawing conclusions from the results beyond "looks like the model is learning", you probably want to set up your own, separate evaluation scripts and consistently work with the same representative evaluation set. This lets you test the model in a way that gives you meaningful insights and can help you find problems in the model and/or the data.

Hi Ines!

I am calculating this based on the labelled data that I have. The data I am training the model with.

I have around 25k examples of labelled data that I use to train and test my model with. When I train the model with textcat.batch-train over this data, the data is split on training (80%) and test set (20%). The accuracy of this model is 0.91.

Afterwards I use the textcat.batch-train resulting model to predict the category for all documents (the ones that I have used for training and testing the model and the ones that I don't know the category). If I check only the ones that I have used for training the model and I compare the "real" label with the label predicted by the model there is a difference in accuracy of 0.10 more or less. If It was able of having a 0.91 accuracy before, I don't understand why it is having around 0.80 with the same set of the dataset.

Anyway I am going to create an evaluation set, that is for sure but I don't know if I am mixing some concepts here.


I think a dedicated evaluation set will definitely make this clearer and easier to compare the results. I think the main points here are:

  • A random split is okay to get a rough idea of whether something is working, but it can easily lead to inconsistent results.
  • When you're running your own evaluation, make sure you're accounting for whether the labels are exclusive. By default, textcat.batch-train assumes that multiple labels can apply (unless you set --exclusive). If you're later evaluating the model assuming that labels are exclusive, you may see significantly worse results.

On a related note, you might also want to check out the text classification evaluation coming to the next version of spaCy – code here: https://github.com/explosion/spaCy/pull/4226

Yes, definitely I will use a dedicated evaluation set.

In fact, I don't want the labels to be exclusive may be the problem is that I am not evaluating the model in the proper way. If the result is Health 0.99, Energy 0.87 and so on, I am getting just the first label because I don't know the threshold I might use to get more than one label. More than 0.8? More than 0.9? I don't mind to get more than one label, It is totally correct but I don't know when I do have to get two labels or just one or three...

Thanks again for the time you spend answering my questions!