Loading gensim word2vec vectors for terms.teach?

You should definitely be able to load your pre-trained vectors. I’m not sure the code in that StackOverflow thread refers to the current version.

Fundamentally you can always add vectors to spaCy as follows. Let’s say you have a list of word strings, and some sequence of vectors. You can do:

nlp.vocab.reset_vectors(shape=shape)
for i, string in enumerate(word_strings):
    nlp.vocab.set_vector(string, vectors[i])

This might be slow for a large number of vectors, but you should only have to do it this way once. After loading in your vectors, you can save out the nlp object with nlp.to_disk(). Then you can pass that directory to Prodigy.

If you’re using pre-trained vectors, take care not to use the md or lg spaCy data packs. These models use the pre-trained GloVe vectors as features. If you use your own pre-trained vectors, the activations will be different for what the model expects, and you’ll get terrible results. The sm model doesn’t use pre-trained vectors, to make it easy to swap in your own.

You might also be interested in the terms.train-vectors recipe. This uses Gensim to train on a text corpus, and saves out the model for use with spaCy. It should serve as a working example of how that’s done.

2 Likes