I found the need for a version of the textcat.batch-train recipe that chooses the model based on F-score rather than accuracy. So I changed the code in textcat.py in the obvious way, introducing
another argument to the recipe and picking it up wherever the current code says “accuracy”.
It works. Happy to offer it back if desired.

Optimal accuracy does not align with optimal F score. I THINK this is happening because the eval dataset
is much more unbalanced than the training set. This is probably something better fixed by changing the evaluation dataset to be more sensible, but its
a shared task so I didn’t do that.

Thanks! We’re in the process of publishing the recipes on Github. We’re just figuring out the best process to build them into the Prodigy wheels once they’re in a separate repo. Once they’re published, pull requests will be very welcome!

About the accuracy vs F-score: This is a topic that makes me feel dumb every now and again, because it seems like it should be quite obvious, but then I find myself scratching my head.

If the model is constrained to output one class prediction per instance, I think accuracy should be the same as micro-averaged F1, right? However, this is obviously not true if we let the model predict multiple classes per instance, which the default spaCy text classification model is allowed to do.

I’d be interested in having that recipe if you’re sharing! One of my textcat models is for rare categories and the f-score just keeps dropping as the accuracy improves…

Dear Chris,
I’m also experiencing some difficulties training a binary classification model with a very small number of examples assigned to “accept” (~5%). Searching for the optimal accuracy tends to produce a classifier which systematic outcome is “reject” with a very high accuracy score. I assume optimizing the learning using F-score would be more fruitful ? Could you please share changes you made in textcat.py ?
Or is there a more generic version to do so ?
Thanks a lot!