This is definitely a good question and it really depends on the data and the label distribution – if you have lots of labels, including some that are rare, you usually want a larger evaluation set to make sure you have all labels covered. If the set is too small, your results will also become harder to interpret: if you're only evaluating on a small number of examples, even one or two individual predictions can easily make up for a few percent in accuracy difference.
In the beginning, aiming for an evaluation set of about the same size as your training set might be a good approach. So you could train on 300 examples and evaluate on 300. Once you're satisfied with your evaluation set, you can then keep it stable and train on 800, 1000, 1200 examples using the same 800 evaluation examples.