PubMed word vectors

beperron · September 7, 2021, 1:10am

I am using Prodigy to build a model that classifies the scholarly literature on COVID. My initial task is quite straightforward -- to separate empirical publications (i.e., actual science) from non-empirical publications (e.g., personal essays, position papers, critiques). My collection is around ~275k, so this is definitely an NLP task. I'm really impressed with the results, so far -- and, of course, the software!

Here is my question: I'm using the en_core_web_lg word vectors in the model. I would actually like to try using the word vectors trained on the PubMed research, which I can obtain here:

https://bio.nlplab.org/

Here is the file manifest:

I'm interested in trying any of the .bin models, but am not having luck getting them into my model. Is there any special configurations I need to do after obtaining these files? These are Word2Vec models. Is there some documentation I missed?

Thanks,
Brian

ines · September 7, 2021, 2:47am

Hi and thanks If the vectors are word2vec vectors, you can import them into spaCy using the init vectors command, which outputs a loadable spaCy model that you can use instead of en_core_web_lg. See here for details: https://spacy.io/usage/linguistic-features#adding-vectors

Btw, just in case you haven't seen it, you might also find some of the scispaCy pipelines useful: scispacy | SpaCy models for biomedical text processing If I read the docs correctly, the included word vectors may even be the same vectors you want to use!

beperron · September 7, 2021, 2:22pm

Thanks so much! Is there a straightforward way to use the scispacy models with Progidy? I installed the package and immediately got Traceback errors. Upon installation, scispacy detected and replaced spacy with an incompatible version. So, then I downloaded the model directly, unzipped and then called the model from the path, as such:

OSError: [E053] Could not read config.cfg from en_core_sci_scibert-0.4.0/config.cfg

My directory is set up as follows:
Screen Shot 2021-09-07 at 10.17.04 AM

Do you have a short answer on whether there is a quick an easy way to the scispacy models within Prodigy? I am a Python newbie, but I have very good support with my university and can work with them if there is a solution.

Thanks again!
Brian

ines · September 8, 2021, 3:49am

I think the problem is that you're trying to load the top-level package directory. If you've unzipped the model, you want to be loading the model data directory inside it (the one that includes the config.cfg).

Btw, the pipelines shipped by scispaCy are Python packages that you can install via pip install so you should be able to just run the following to install them in your environment (without unzipping!):

pip install en_core_sci_scibert-0.4.0.tar.gz

You can then load the pipeline in spaCy as nlp = spacy.load("en_core_sci_scibert").

Topic		Replies	Views
biomedical nlp models in spacy usage , spacy , solved , gensim	4	2401	February 28, 2018
How to use two .txt files one with vectors the other with words usage , spacy , solved	4	1940	May 26, 2018
How do I work with available word vectors during NER training? ner , training	3	361	June 30, 2022
Word vectors: How do they work? usage	1	1437	April 8, 2018
Do the outputted models using textcat.batch-train make use of word vectors? usage , textcat , spacy	2	595	March 28, 2019

PubMed word vectors

Related topics