We have great resources of word2vec model on biomedical text generated using gensim. Can we load them in spacy like any other model pointing to a directory ?
Yes, you’ll be able to load these vectors with spaCy and use them with Prodigy. You can either create load the vectors in a custom recipe, or create a script that loads the vectors in a spaCy model and then saves the model to a directory, with
nlp.to_disk(). Once the model has been saved to a directory, the vectors should be there, ready to use.
The only thing to keep in mind is, if you’re loading in your own vectors, you should base your model on
en_core_web_lg use pre-trained vectors as features in the tagger, parser and NER models. This means that if you replace the built-in vectors with other vectors in those models, you’ll mess up the predictions.
I first converted the word2vec file to txt using gensim like below:
model = KeyedVectors.load_word2vec_format(’/Users/philips/Downloads/wikipedia-pubmed-and-PMC-w2v.bin’, binary=True)
I have used the following script to save vector to disk and used language as “en”. Does that sound right?
In terms of using en_core_web_sm, i did not see a need while saving to disk, is that ok ?
Have a look here: Loading gensim word2vec vectors for terms.teach?
Same use case.