Tips for training set composition

rpedela · January 19, 2019, 9:37pm

I have a text classification dataset where the ratio between positive and negative examples is 1:10. I am curious if there are any best practices for training set composition? Should I use all the examples? Randomly select negative examples to produce a 1:1 ratio? Something else?

honnibal · January 24, 2019, 12:55am

I’d say for a 10:1 class balance you’ll probably be okay ignoring it and just training the model. If you do decide to under-sample the negative classes, make sure you’re evaluating against a sample that has a class balance that matches the original data.

One other thing to think about when you have imbalanced classes is whether your objective is also imbalanced. Sometimes you just don’t want to miss the positive examples. Other times you’re fine with missing true examples, you just want a low false positive rate. This can change the active learning priorities a bit, which can motivate some custom weighting. But if you really just want highest accuracy, I would start off ignoring the class imabalance and seeing how you go.

Topic		Replies	Views
Best practices for validation sets usage , best-practices	1	2905	July 21, 2018
Imbalanced classes in a multiclass textcat leads to completely biased predictions usage , textcat	7	4018	February 21, 2018
Imbalanced data suggestions - NER usage , ner	6	754	May 27, 2022
Bad precision good recall with imbalanced data usage , textcat	1	907	June 3, 2020
Practical use of rejected textcat.teach annotations for downstream tasks	2	89	May 24, 2024

Tips for training set composition

Related topics