biomedical nlp models in spacy

madhujahagirdar · February 18, 2018, 12:31pm

http://evexdb.org/pmresources/vec-space-models/

We have great resources of word2vec model on biomedical text generated using gensim. Can we load them in spacy like any other model pointing to a directory ?

honnibal · February 18, 2018, 5:53pm

Yes, you’ll be able to load these vectors with spaCy and use them with Prodigy. You can either create load the vectors in a custom recipe, or create a script that loads the vectors in a spaCy model and then saves the model to a directory, with nlp.to_disk(). Once the model has been saved to a directory, the vectors should be there, ready to use.

The only thing to keep in mind is, if you’re loading in your own vectors, you should base your model on en_core_web_sm. Both en_core_web_md and en_core_web_lg use pre-trained vectors as features in the tagger, parser and NER models. This means that if you replace the built-in vectors with other vectors in those models, you’ll mess up the predictions.

madhujahagirdar · February 28, 2018, 2:10pm

I first converted the word2vec file to txt using gensim like below:

model = KeyedVectors.load_word2vec_format(’/Users/philips/Downloads/wikipedia-pubmed-and-PMC-w2v.bin’, binary=True)
model.wv.save_word2vec_format(’/Users/philips/Downloads/wikipedia-pubmed-and-PMC-w2v.txt’)

and then

I have used the following script to save vector to disk and used language as “en”. Does that sound right?

github.com

explosion/spacy/blob/master/examples/vectors_fast_text.py

#!/usr/bin/env python
# coding: utf8
"""Load vectors for a language trained using fastText
https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
Compatible with: spaCy v2.0.0+
"""
from __future__ import unicode_literals
import plac
import numpy

import spacy
from spacy.language import Language


@plac.annotations(
    vectors_loc=("Path to .vec file", "positional", None, str),
    lang=("Optional language ID. If not set, blank Language() will be used.",
          "positional", None, str))
def main(vectors_loc, lang=None):
    if lang is None:

This file has been truncated. show original

madhujahagirdar · February 28, 2018, 2:11pm

In terms of using en_core_web_sm, i did not see a need while saving to disk, is that ok ?

beckerfuffle · February 28, 2018, 4:36pm

Have a look here: Loading gensim word2vec vectors for terms.teach?

Same use case.

Topic		Replies	Views
Loading gensim word2vec vectors for terms.teach? usage , terms , solved , gensim	17	5145	August 15, 2018
How to use two .txt files one with vectors the other with words usage , spacy , solved	4	1940	May 26, 2018
PubMed word vectors textcat , custom , solved , medical	3	848	September 8, 2021
Using Fastext vector model in Prodigy? usage , spacy , solved	7	3403	March 15, 2018
Convert Gensim FastText to spaCy-readable Word2Vec format for terms.teach recipe spacy , terms , solved , gensim	4	1495	September 11, 2020

biomedical nlp models in spacy

Related topics