Evaluating a text classification model

MBSanchez · September 20, 2019, 10:05am

Hi,

I am trying to classify text into 10 categories. I have example data that I am feeding into Prodigy and I am training a model with that data. The process I am following is:

Creating around 25k accept and reject examples.
Creating a dataset with those examples with db-in
Using textcat.batch-train recipe to get a the model I am going to use for predicting.

I am having 0.91 as accuracy and f-score ( Correct 5162 and Incorrect 509)

But when I use the resulting model for predicting categories in unseen and seen text (just for checking), I can verify that it predicts correctly only around the 80% of my seen data, even if it is the data I have been training and testing my model with. How is it possible to have such a difference? Is it not supposed to be near 90%?

How can I improve my model?

Thanks!

ines · September 21, 2019, 9:51am

How exactly are you checking and calculating this? What do you count as "correct"? And how did you evaluate the model after training – did you use a dedicated evaluation set or just let it hold back some portion of the data?

The thing is, when you evaluate a model, the score reflects how accurate the model's predictions are on the given evaluation set. If the evaluation set is good and representative, those results should ideally generalise to other similar datasets. So if you ran the same evaluation again on that data, you should get similar results. If the evaluation set is unideal, it may produce less reliable results.

If you're serious about evaluating and drawing conclusions from the results beyond "looks like the model is learning", you probably want to set up your own, separate evaluation scripts and consistently work with the same representative evaluation set. This lets you test the model in a way that gives you meaningful insights and can help you find problems in the model and/or the data.

MBSanchez · September 23, 2019, 5:17pm

Hi Ines!

I am calculating this based on the labelled data that I have. The data I am training the model with.

I have around 25k examples of labelled data that I use to train and test my model with. When I train the model with textcat.batch-train over this data, the data is split on training (80%) and test set (20%). The accuracy of this model is 0.91.

Afterwards I use the textcat.batch-train resulting model to predict the category for all documents (the ones that I have used for training and testing the model and the ones that I don't know the category). If I check only the ones that I have used for training the model and I compare the "real" label with the label predicted by the model there is a difference in accuracy of 0.10 more or less. If It was able of having a 0.91 accuracy before, I don't understand why it is having around 0.80 with the same set of the dataset.

Anyway I am going to create an evaluation set, that is for sure but I don't know if I am mixing some concepts here.

Thanks!

ines · September 24, 2019, 8:07am

I think a dedicated evaluation set will definitely make this clearer and easier to compare the results. I think the main points here are:

A random split is okay to get a rough idea of whether something is working, but it can easily lead to inconsistent results.
When you're running your own evaluation, make sure you're accounting for whether the labels are exclusive. By default, textcat.batch-train assumes that multiple labels can apply (unless you set --exclusive). If you're later evaluating the model assuming that labels are exclusive, you may see significantly worse results.

On a related note, you might also want to check out the text classification evaluation coming to the next version of spaCy – code here: https://github.com/explosion/spaCy/pull/4226

MBSanchez · September 24, 2019, 10:11am

Yes, definitely I will use a dedicated evaluation set.

In fact, I don't want the labels to be exclusive may be the problem is that I am not evaluating the model in the proper way. If the result is Health 0.99, Energy 0.87 and so on, I am getting just the first label because I don't know the threshold I might use to get more than one label. More than 0.8? More than 0.9? I don't mind to get more than one label, It is totally correct but I don't know when I do have to get two labels or just one or three...

Thanks again for the time you spend answering my questions!

Topic		Replies	Views
textcat.batch-train usage , textcat	3	1263	August 29, 2018
TextCat Training Results on a per label basis. usage , textcat	1	442	February 18, 2019
Can't improve textcat model performance textcat	2	389	May 3, 2020
How much training data for multiclass/multilabel text classification?	3	1004	November 1, 2022
Inconsistent results textcat	4	466	March 20, 2020

Evaluating a text classification model

Related topics