How do I work with available word vectors during NER training?

nanyasrivastav · June 28, 2022, 2:31pm

Hello,

I want to train a NER model to automatically extract "nanoparticle" entities in biomedical text. Since it is a new entity type, I decided to train the model from scratch and use existing word vectors that have been trained on PubMed corpus (biomedical text) to help lift my model's performance.

This is probably the wrong way to do it, but I downloaded the PubMed-w2v.bin file, and followed the same command used for the food ingredients example:

python -m prodigy train ./model --ner dataset_name --base-model en_core_sci_md
--paths.init-tok2vec ./PubMed-w2v.bin --eval-split 0.2

I am not able to figure out how to incorporate the available word vectors into my workflow. The above command gives horrible results, which I guess makes sense because the word2vec file isn't exactly the same as pretrained tok2vec weights used in the food ingredients example (?)

koaning · June 29, 2022, 10:50am

An easier way might be to initialize a new spaCy model with these vectors beforehand.

Have you seen the init vectors command? This will allow you to create a new spaCy model on disk that carries your embeddings in it. This local model can be then referenced as a starting point via --base-model in the Prodigy train command.

Let me know if this does not work for you, but this is how I usually use different vectors when I run benchmarks.

nanyasrivastav · June 29, 2022, 1:46pm

Hello Vincent,

Yes, I have checked out that command. I even tried it out, but I got a bunch of errors. I guess it was because the w2v file that I have downloaded from this website is a .bin file, but the documentation on SpaCy's website requires it to be in the .txt format or a zipped text file in .zip or .tar.gz format. The size of the downloaded .bin file is 1.77 GB, will I have to convert it and then check if it works? I was hoping to find another way to directly use the .bin file using the init vectors command.

koaning · June 30, 2022, 6:53am

Do you happen to know how these vectors were trained? With Gensim? FastText? Are you aware of any documentation for these vectors?

I'm also wondering, did you try running the other models/vectors from scispacy?

Topic		Replies	Views
Help with training from scratch english NER model with pretrained Gensim vectors usage , ner , spacy	2	645	January 27, 2022
Loading fasttext vectors to spacy/prodigy ner , spacy , solved	9	1544	February 13, 2022
How to use two .txt files one with vectors the other with words usage , spacy , solved	4	1940	May 26, 2018
sense2vec ner usage , ner , spacy	1	298	October 6, 2021
PubMed word vectors textcat , custom , solved , medical	3	848	September 8, 2021

How do I work with available word vectors during NER training?

Related topics