You should definitely be able to load your pre-trained vectors. I’m not sure the code in that StackOverflow thread refers to the current version.
Fundamentally you can always add vectors to spaCy as follows. Let’s say you have a list of word strings, and some sequence of vectors. You can do:
nlp.vocab.reset_vectors(shape=shape)
for i, string in enumerate(word_strings):
nlp.vocab.set_vector(string, vectors[i])
This might be slow for a large number of vectors, but you should only have to do it this way once. After loading in your vectors, you can save out the nlp object with nlp.to_disk()
. Then you can pass that directory to Prodigy.
If you’re using pre-trained vectors, take care not to use the md
or lg
spaCy data packs. These models use the pre-trained GloVe vectors as features. If you use your own pre-trained vectors, the activations will be different for what the model expects, and you’ll get terrible results. The sm
model doesn’t use pre-trained vectors, to make it easy to swap in your own.
You might also be interested in the terms.train-vectors
recipe. This uses Gensim to train on a text corpus, and saves out the model for use with spaCy. It should serve as a working example of how that’s done.