biomedical nlp models in spacy

http://evexdb.org/pmresources/vec-space-models/

We have great resources of word2vec model on biomedical text generated using gensim. Can we load them in spacy like any other model pointing to a directory ?

Yes, you’ll be able to load these vectors with spaCy and use them with Prodigy. You can either create load the vectors in a custom recipe, or create a script that loads the vectors in a spaCy model and then saves the model to a directory, with nlp.to_disk(). Once the model has been saved to a directory, the vectors should be there, ready to use.

The only thing to keep in mind is, if you’re loading in your own vectors, you should base your model on en_core_web_sm. Both en_core_web_md and en_core_web_lg use pre-trained vectors as features in the tagger, parser and NER models. This means that if you replace the built-in vectors with other vectors in those models, you’ll mess up the predictions.

I first converted the word2vec file to txt using gensim like below:

model = KeyedVectors.load_word2vec_format(’/Users/philips/Downloads/wikipedia-pubmed-and-PMC-w2v.bin’, binary=True)
model.wv.save_word2vec_format(’/Users/philips/Downloads/wikipedia-pubmed-and-PMC-w2v.txt’)

and then

I have used the following script to save vector to disk and used language as “en”. Does that sound right?

In terms of using en_core_web_sm, i did not see a need while saving to disk, is that ok ?

Have a look here: Loading gensim word2vec vectors for terms.teach?

Same use case.