Can word vectors corpus overlap with text classification training examples?

rpedela · April 2, 2019, 4:48pm

I have a large corpus of financial documents that I plan to use for word vectors and transfer learning. Some of those documents are also in the training and evaluation sets for text classification. Should I keep those documents separate or can I use them for both models? Are there any pros/cons either way?

honnibal · April 3, 2019, 12:37pm

It’s okay if the word vectors and pretraining texts overlap with your training texts, but I would keep the evaluation texts separate, and avoid training word vectors on them. Actually if your training texts and your word vectors texts overlap, you might get better accuracy on your training documents if you’re using a model to assist with the annotations, so it might even be helpful. I don’t think it’ll matter much either way, though.

Even the overlap with the evaluation data probably won’t matter much. But if you don’t keep your evaluation texts separate, you’ll always have this nagging doubt: what if it did matter, and my evaluation numbers are a bit optimistic because of this effect? The point of the evaluation is to give you information, and the information is clearer if there are fewer of these hard-to-reason-about potential interactions.

rpedela · April 3, 2019, 3:37pm

Thanks Matthew

Topic		Replies	Views
Spancat: use of embeddings, compatibility with transformers, upstream to relationship extraction usage , relations , spancat	4	781	November 17, 2021
Word vectors: How do they work? usage	1	1435	April 8, 2018
Merging/adding data from different texts usage , ner , database	2	876	March 1, 2019
Do the outputted models using textcat.batch-train make use of word vectors? usage , textcat , spacy	2	595	March 28, 2019
Categorisation in foreign languages - are word vectors enough? textcat , spacy	2	629	May 3, 2019

Can word vectors corpus overlap with text classification training examples?

Related topics