Evaluating a text classification model

How exactly are you checking and calculating this? What do you count as "correct"? And how did you evaluate the model after training – did you use a dedicated evaluation set or just let it hold back some portion of the data?

The thing is, when you evaluate a model, the score reflects how accurate the model's predictions are on the given evaluation set. If the evaluation set is good and representative, those results should ideally generalise to other similar datasets. So if you ran the same evaluation again on that data, you should get similar results. If the evaluation set is unideal, it may produce less reliable results.

If you're serious about evaluating and drawing conclusions from the results beyond "looks like the model is learning", you probably want to set up your own, separate evaluation scripts and consistently work with the same representative evaluation set. This lets you test the model in a way that gives you meaningful insights and can help you find problems in the model and/or the data.