I have tried different things (and for that matter, I tried the other PubMed dataset Terms Trains Crashing but I could not make it work due to some weird encoding errors)
I am not sure how I can create a Spacy model with these vectors to then use it in Prodigy to train new entities…
I have tried many things, here is my last few attemtps:
from gensim.models import KeyedVectors
import spacy
import numpy as np
word2vec = KeyedVectors.load_word2vec_format('PubMed-w2v.bin', binary=True)
word2vec.wv.save_word2vec_format('./data/medical-w2v.bin', binary=True)
nlp = spacy.load("en_core_web_sm", vectors=False)
rows, cols = 0, 0
for i, line in enumerate(open('./data/medical-w2v.bin', 'r', encoding='latin1')):
if i == 0:
rows, cols = line.split()
rows, cols = int(rows), int(cols)
nlp.vocab.reset_vectors(shape=(rows, cols))
else:
word, *vec = line.split()
vec = np.array([i for i in vec])
# vec = np.array([float(i) for i in vec])
nlp.vocab.set_vector(word, vec)
print(word)
nlp.to_disk('spacy_word2vec')
I get few errors:
if I do not put the latin encoding, it crashes with unrecognised characters,
if I cast to float I get erros because somehow there are strings: ValueError: could not convert string to float: '\x08òÞ:Räâ:×ÁÅ:nùé:U\x9dÁº\x8d\x7fî:¦&\x96º¸èJ:\x98ÍϺh\x80˺CƼ:v¥¨¹\x17cö:û\x9c\x13:ö\x15µ9*«\x11»lìÒ:\x94Z\x04:'
if I remove the casting to float method, I get: ValueError: could not broadcast input array from shape (2) into shape (200)
Actually I do not really understand what I am trying to achieve with this conversion. And how best to try those medical word vectors on my texts…
Thank your for help and sorry if I was not clear before, as it was late in the night
Hi! To answer your first question, creating a spaCy model with Word2Vec vectors should be as simple as this:
for word in w2v.wv.vocab:
nlp.vocab.set_vector(word, w2v.wv.word_vec(word))
(The source of the terms.train-vectors recipe is shipped with Prodigy, so you can also have a look at the code and see how the training plus creating a spaCy model works here).
The init-model command expects a tab-separated file in the Word2Vec format, where the first line is a string tuple of the shape. See my comment here and this post for more details on the format.