Do the outputted models using textcat.batch-train make use of word vectors?

bigbeaker · March 28, 2019, 10:03am

Hi guys

Quick question on the models outputted by prodigy - do they use word vectors?
The model I trained and outputted seems like it doesn’t have any knowledge of word vectors and the model is quite small in size ~10mb

If not, how can I use the word-vectors I have for pretraining?

Thanks!

ines · March 28, 2019, 10:29am

If word vectors are present in the model you're updating then yes, spaCy will use those representations during training. This can sometimes give you a nice boost in accuracy.

If your model is only 10mb, it's likely that you started off with an sm model that doesn't have word vectors. So instead, try using a model like en_core_web_md or en_core_web_lg.

Actual pre-training is different – that's something we just introduced in spaCy v2.1. Here, you're pre-training weights using lots of raw unlabelled text and word vectors. Also see this blog post for examples. Once you have that artifact, you can pass it in when you train your model. In the next version of Prodigy, which will introduce support for spaCy v2.1, you'll also be able to pass in those pre-trained weights files in the textcat.teach and ner.teach recipes.

bigbeaker · March 28, 2019, 12:28pm

ah got it! makes sense

Thanks ines

Topic		Replies	Views
Training, pretraining best practices and deeper understanding usage , best-practices	3	965	October 24, 2019
Word vectors: How do they work? usage	1	1437	April 8, 2018
Spacy pretrain best practices usage , done , spacy	16	5280	March 13, 2020
Loading fasttext vectors to spacy/prodigy ner , spacy , solved	9	1544	February 13, 2022
Loading gensim word2vec vectors for terms.teach? usage , terms , solved , gensim	17	5145	August 15, 2018

Do the outputted models using textcat.batch-train make use of word vectors?

Related topics