textcat.eval use and Prodi.gy's evaluation workflow?

Thanks a lot, that’s nice to hear! :blush:

The textcat.eval recipe (see here for example usage and output) is mostly useful to create an evaluation set in “real time” and see how your model is performing on unseen text.

For example, let’s say you’ve trained or updated a model and you want to see how it performs on new data. You can then use textcat.eval with your model and stream in the texts you want to test it on. The web app lets you click accept/reject on the model’s predictions, and when you exit the server, you’ll see a detailed breakdown of how the model performed, compared to the “correct” answers (i.e. your decisions):

MODEL   USER   COUNT
accept  accept    47   # both you and the model said yes
accept  reject     7   # model said yes, you said no
reject  reject    95   # both you and the model said no
reject  accept     7   # model said no, you said yes 

Correct     142        # total correct predictions
Incorrect    14        # total incorrect predictions

Baseline      0.65     # baseline to beat (score if all answers were the same)
Precision     0.87
Recall        0.87
F-score       0.87

We think the recipe is especially useful as a developer tool, while you’re still working on the model and tweaking it. It may not replace your final evaluation process, but it’s a quick sanity check and a quick way of labelling evaluation data and evaluating the model at the same time.

(I also makes it easy to ask a colleague to do a quick evaluation run for you, if you’re worried that you’re not “strict” enough with your model :wink: All they have to do is click a few hundred times, and you’ll immediately have some numbers and at least a rough idea of whether you’re on the right track or not.)

1 Like