Best practices for validation sets

I noticed that how I choose my validation set is having an impact on my final model. Are there any reasons you can think of that might cause this besides choosing the best model from the performance on the validation set?

Do you have any ideas on best practices on creating a good validation set? If I am working on a binary imbalanced dataset should I create the validation set be split 50/50 or be split according to the data distribution?


You probably want to draw your validation set from the real data distribution — so no, I wouldn’t bias it. If you care more about accuracy on the positive examples, I would reflect that in the metric, e.g. use weighted F-score.

Some more general advice about evaluation data, to explain the logic:

  • When you have less than about 5000 training examples (i.e. very early in a project), make decisions based on the cross-validation splits in the batch-train recipe, or via the A/B evaluation

  • After that, section off a number of examples to annotate for evaluation. Using the next month of data after your training set is often good. Try to make sure you’re not splitting up logical units of your data and having some land in the training set and some land in the evaluation set. For instance, don’t split up a document such that some of its sentences are in the training set, and others in the evaluation set. If you’re working with threaded comments, don’t split threads between training and evaluation. This will introduce a bias, because you’ll have unrealistically few unseen words in your evaluation data (word frequencies are ‘bursty’: if a word is mentioned in one comment in a thread, chances of it being mentioned elsewhere in the thread are much higher.)

  • Shuffle your chosen evaluation portion, and annotate a few hundred examples. If you’re doing NER, make sure you’re annotating them using the ner.make-gold or ner.manual recipes, as you want complete annotations, and you don’t want the tool to skip examples.

  • A few hundred evaluation examples allows you to reasonably discriminate between methods with, say, 60% accuracy and methods with, say, 80% accuracy. When you get results which are closer, e.g. between 70% and 73%, if you only have a few hundred examples, that 3% difference might mean only 10 different predictions. That’s not a reliable basis to conclude one method is better. So, in order to optimise your approach with a smaller step-size (i.e. to be able to stack up lots of small improvements), you need more evaluation data. At this point, go back and annotate more examples, hopefully using the same methodology as before. It may be worth measuring your model’s performance on the first data you created for evaluation, and this new set. If the accuracies differ, your annotation practices might have changed — after all, you should have been drawing the examples randomly, so there shouldn’t be a meaningful difference between the two sets. You might want to go over them again and recheck the decisions to make them consistent.

  • Once you’re running more experiments, you’ll need test data as well as development data. Depending on how many experiments you’ve run or what you’ve decided, you might be able to just split your existing annotations into two groups. You want to do this randomly: the test and development data need to be drawn from the same distribution. All you’re trying to do here is avoid making decisions that happen to be good for one set of examples, but not for another. This is basically optimisation through random search, which isn’t a very powerful technique — so the risk of over-fitting the specific examples isn’t that significant.

Overall I think the emphasis on development vs test data isn’t hugely relevant for most practitioners. It’s something that matters a lot when you have a whole community of researchers publishing on the same data. The test data lets us compare results from two researchers, without the confounder of how much random search they ran. For a single experimenter, performance on your development and test sets should be very well correlated, if your evaluation data is large enough. If it’s not large enough, you’d usually rather merge the datasets, and just have one evaluation. What you’re trying to do is form reliable conclusions about which of your choices is making the model better.

You might find Andrew Ng’s recent book Machine Learning Yearning useful: . It covers these topics very well, in a short and readable form.