I have a large corpus of financial documents that I plan to use for word vectors and transfer learning. Some of those documents are also in the training and evaluation sets for text classification. Should I keep those documents separate or can I use them for both models? Are there any pros/cons either way?
It’s okay if the word vectors and pretraining texts overlap with your training texts, but I would keep the evaluation texts separate, and avoid training word vectors on them. Actually if your training texts and your word vectors texts overlap, you might get better accuracy on your training documents if you’re using a model to assist with the annotations, so it might even be helpful. I don’t think it’ll matter much either way, though.
Even the overlap with the evaluation data probably won’t matter much. But if you don’t keep your evaluation texts separate, you’ll always have this nagging doubt: what if it did matter, and my evaluation numbers are a bit optimistic because of this effect? The point of the evaluation is to give you information, and the information is clearer if there are fewer of these hard-to-reason-about potential interactions.