PubMed word vectors

I am using Prodigy to build a model that classifies the scholarly literature on COVID. My initial task is quite straightforward -- to separate empirical publications (i.e., actual science) from non-empirical publications (e.g., personal essays, position papers, critiques). My collection is around ~275k, so this is definitely an NLP task. I'm really impressed with the results, so far -- and, of course, the software!

Here is my question: I'm using the en_core_web_lg word vectors in the model. I would actually like to try using the word vectors trained on the PubMed research, which I can obtain here:

Here is the file manifest:

I'm interested in trying any of the .bin models, but am not having luck getting them into my model. Is there any special configurations I need to do after obtaining these files? These are Word2Vec models. Is there some documentation I missed?


Hi and thanks :blush: If the vectors are word2vec vectors, you can import them into spaCy using the init vectors command, which outputs a loadable spaCy model that you can use instead of en_core_web_lg. See here for details:

Btw, just in case you haven't seen it, you might also find some of the scispaCy pipelines useful: scispacy | SpaCy models for biomedical text processing If I read the docs correctly, the included word vectors may even be the same vectors you want to use!

Thanks so much! Is there a straightforward way to use the scispacy models with Progidy? I installed the package and immediately got Traceback errors. Upon installation, scispacy detected and replaced spacy with an incompatible version. So, then I downloaded the model directly, unzipped and then called the model from the path, as such:

OSError: [E053] Could not read config.cfg from en_core_sci_scibert-0.4.0/config.cfg

My directory is set up as follows:
Screen Shot 2021-09-07 at 10.17.04 AM

Do you have a short answer on whether there is a quick an easy way to the scispacy models within Prodigy? I am a Python newbie, but I have very good support with my university and can work with them if there is a solution.

Thanks again!

I think the problem is that you're trying to load the top-level package directory. If you've unzipped the model, you want to be loading the model data directory inside it (the one that includes the config.cfg).

Btw, the pipelines shipped by scispaCy are Python packages that you can install via pip install so you should be able to just run the following to install them in your environment (without unzipping!):

pip install en_core_sci_scibert-0.4.0.tar.gz

You can then load the pipeline in spaCy as nlp = spacy.load("en_core_sci_scibert").