I'm struggling to print model accuracy when running prodigy train textcat. I would have imagined this should be simple to do, but --label-stats only returns precision, recall, and F1. Is there a simple way to print accuracy as well? I understand the issues with accuracy -- especially when classes are imbalanced -- but disciplinary traditions in the field in in which I'm working demand accuracy as the standard evaluation metric. Is there an easy way to include it in the output of prodigy train?
hi @cbjrobertson!
Thanks for your question.
Prodigy's evaluation is using spaCy's scorer (see code). Prodigy's prodigy train
is just a wrapper for that. Unfortunately I don't see that available (they offer many of the other, more common evaluation metrics like Precision, Recall, F1 (Micro and Macro) and AUC) but not to my knowledge offer raw accuracy exactly because its distortion effects, especially in imbalanced data.
One idea is if you have trained your own model (e.g., run prodigy train my_model_folder --textcat train_data,eval:eval_data
) and your model is now in the folder my_model_folder
, you can find the meta.json
file that includes the full model scorer including performance by category. If you used a dedicated holdout evaluation dataset (e.g., like eval_data
) to calculate your counts by class, you can likely backout the raw accuracy.
Otherwise, you may likely need to write your own custom script. We had a similar request for a confusion matrix but that was for ner
, not classification.
Another idea would be to ignore Prodigy/spaCy's evaluation metrics by themselves. After you train a model, take your evaluation dataset, score it on your model, get your predicted probabilities, and then manually set your own thresholds to calculate your own accuracy by hand. This may help convince others as they could see in a spreadsheet the calculations.
I'm sorry that there isn't (to my knowledge) a simpler way of doing this -- but I think spaCy's scorer was designed this way (aka without raw accuracy available) to avoid misinterpreting raw accuracy for multiclass models for reasons you mentioned.
This maybe doesn't answer your question, but you may also find @koaning's spacy-report package for textcat
models can be very helpful with different thresholds and visualize their effects on precision/recall by label.
Thanks for that useful answer, it echoes my own conclusions (I ended up running predict on the hold-out data and calculating a confusion matrix by hand). Thanks for spacy-report, I'll definitely check it out.