I have a text classification dataset where the ratio between positive and negative examples is 1:10. I am curious if there are any best practices for training set composition? Should I use all the examples? Randomly select negative examples to produce a 1:1 ratio? Something else?
I’d say for a 10:1 class balance you’ll probably be okay ignoring it and just training the model. If you do decide to under-sample the negative classes, make sure you’re evaluating against a sample that has a class balance that matches the original data.
One other thing to think about when you have imbalanced classes is whether your objective is also imbalanced. Sometimes you just don’t want to miss the positive examples. Other times you’re fine with missing true examples, you just want a low false positive rate. This can change the active learning priorities a bit, which can motivate some custom weighting. But if you really just want highest accuracy, I would start off ignoring the class imabalance and seeing how you go.