textcat.eval use and Prodi.gy's evaluation workflow?

I’ve been loving Prodi.gy thus far and its integration with spaCy. It’s been fun so far.

I’m currently training the model to do some tedious text classification. So far, so good. However, one aspect of the work flow is quite mysterious to me: the textcat.eval function.

Say I successfully batch-train a model (exported as /tmp/model), setting aside a share of my database for evaluation (say, evaluation.jsonl). I was unsure as to what precisely textcat.eval does and how we use it as part of our workflow? Presumably, we feed run in our trained model (/tmp/model) and use our evaluation dataset (evaluation.jsonl). However, I was unsure about what it specifically does to our model and where it fits into a project.

1 Like

Thanks a lot, that’s nice to hear! :blush:

The textcat.eval recipe (see here for example usage and output) is mostly useful to create an evaluation set in “real time” and see how your model is performing on unseen text.

For example, let’s say you’ve trained or updated a model and you want to see how it performs on new data. You can then use textcat.eval with your model and stream in the texts you want to test it on. The web app lets you click accept/reject on the model’s predictions, and when you exit the server, you’ll see a detailed breakdown of how the model performed, compared to the “correct” answers (i.e. your decisions):

MODEL   USER   COUNT
accept  accept    47   # both you and the model said yes
accept  reject     7   # model said yes, you said no
reject  reject    95   # both you and the model said no
reject  accept     7   # model said no, you said yes 

Correct     142        # total correct predictions
Incorrect    14        # total incorrect predictions

Baseline      0.65     # baseline to beat (score if all answers were the same)
Precision     0.87
Recall        0.87
F-score       0.87

We think the recipe is especially useful as a developer tool, while you’re still working on the model and tweaking it. It may not replace your final evaluation process, but it’s a quick sanity check and a quick way of labelling evaluation data and evaluating the model at the same time.

(I also makes it easy to ask a colleague to do a quick evaluation run for you, if you’re worried that you’re not “strict” enough with your model :wink: All they have to do is click a few hundred times, and you’ll immediately have some numbers and at least a rough idea of whether you’re on the right track or not.)

1 Like

Ah, thank you so much for the reply. This is helpful!

Just a very quick follow-up: this is merely for evaluation and sanity checks… textcat.eval isn’t updating the trained model that we feed into it, right?

That’s correct. The recipe won’t update the model in the loop and will only get predictions on the incoming data, and then compare those to your answer.