How do I work with available word vectors during NER training?


I want to train a NER model to automatically extract "nanoparticle" entities in biomedical text. Since it is a new entity type, I decided to train the model from scratch and use existing word vectors that have been trained on PubMed corpus (biomedical text) to help lift my model's performance.

This is probably the wrong way to do it, but I downloaded the PubMed-w2v.bin file, and followed the same command used for the food ingredients example:

python -m prodigy train ./model --ner dataset_name --base-model en_core_sci_md
--paths.init-tok2vec ./PubMed-w2v.bin --eval-split 0.2

I am not able to figure out how to incorporate the available word vectors into my workflow. The above command gives horrible results, which I guess makes sense because the word2vec file isn't exactly the same as pretrained tok2vec weights used in the food ingredients example (?)

An easier way might be to initialize a new spaCy model with these vectors beforehand.

Have you seen the init vectors command? This will allow you to create a new spaCy model on disk that carries your embeddings in it. This local model can be then referenced as a starting point via --base-model in the Prodigy train command.

Let me know if this does not work for you, but this is how I usually use different vectors when I run benchmarks.

Hello Vincent,

Yes, I have checked out that command. I even tried it out, but I got a bunch of errors. I guess it was because the w2v file that I have downloaded from this website is a .bin file, but the documentation on SpaCy's website requires it to be in the .txt format or a zipped text file in .zip or .tar.gz format. The size of the downloaded .bin file is 1.77 GB, will I have to convert it and then check if it works? I was hoping to find another way to directly use the .bin file using the init vectors command.

Do you happen to know how these vectors were trained? With Gensim? FastText? Are you aware of any documentation for these vectors?

I'm also wondering, did you try running the other models/vectors from scispacy?