How to use two .txt files one with vectors the other with words

Hello,

I have found biomedical data set trained on PubMed which consits of two files, one with words the other one with vectors:

words:

allergic-asthma
allergic-asthmatic
allergic-hyperergic
allergic-induced
allergic-inflammation
allergic-inflammatory
allergic-like
allergic-type
...

vectors:

0.0528996 0.0873517 0.0526077 -0.207811 0.2837 0.214404 0.141968 -0.0336167 [truncated]
...

I have tried different things (and for that matter, I tried the other PubMed dataset Terms Trains Crashing but I could not make it work due to some weird encoding errors)

I am not sure how I can create a Spacy model with these vectors to then use it in Prodigy to train new entities…

I have tried many things, here is my last few attemtps:

from gensim.models import KeyedVectors
import spacy
import numpy as np

word2vec = KeyedVectors.load_word2vec_format('PubMed-w2v.bin', binary=True)
word2vec.wv.save_word2vec_format('./data/medical-w2v.bin', binary=True)

nlp = spacy.load("en_core_web_sm", vectors=False)
rows, cols = 0, 0
for i, line in enumerate(open('./data/medical-w2v.bin', 'r', encoding='latin1')):
    if i == 0:
        rows, cols = line.split()
        rows, cols = int(rows), int(cols)
        nlp.vocab.reset_vectors(shape=(rows, cols))
    else:
        word, *vec = line.split()
        vec = np.array([i for i in vec])
        # vec = np.array([float(i) for i in vec])
        nlp.vocab.set_vector(word, vec)
        print(word)

nlp.to_disk('spacy_word2vec')

I get few errors:

  • if I do not put the latin encoding, it crashes with unrecognised characters,
  • if I cast to float I get erros because somehow there are strings:
    ValueError: could not convert string to float: '\x08òÞ:Räâ:×ÁÅ:nùé:U\x9dÁº\x8d\x7fî:¦&\x96º¸èJ:\x98ÍϺh\x80˺CƼ:v¥¨¹\x17cö:û\x9c\x13:ö\x15µ9*«\x11»lìÒ:\x94Z\x04:'
  • if I remove the casting to float method, I get:
    ValueError: could not broadcast input array from shape (2) into shape (200)

Actually I do not really understand what I am trying to achieve with this conversion. And how best to try those medical word vectors on my texts…

Thank your for help and sorry if I was not clear before, as it was late in the night :wink:

One more question,

For testing puproses I have combined a few lines of both text files:

<word><space><vector>

but doing

spacy init-model en /data --vectors-loc ./word2vecTools/test.txt

I get:

ValueError: invalid literal for int() with base 10: '#'

The thing is that the first “word” in the list is #

Why does spacy tries to cast it to an int do I need to add a key?

<id><space><word><space><vector>

(it seems not to work, I have tried… weirdly I get the same error…)

Hi! To answer your first question, creating a spaCy model with Word2Vec vectors should be as simple as this:

for word in w2v.wv.vocab:
    nlp.vocab.set_vector(word, w2v.wv.word_vec(word))

(The source of the terms.train-vectors recipe is shipped with Prodigy, so you can also have a look at the code and see how the training plus creating a spaCy model works here).

The init-model command expects a tab-separated file in the Word2Vec format, where the first line is a string tuple of the shape. See my comment here and this post for more details on the format.

That was that simple… Thank you Ines!

1 Like