I have a set of pre-trained word vectors I created with gensim word2vec I’d like to use with the terms.teach recipe. These vectors are very domain specific which is why I’d like to use them instead of pretrained embeddings. I’ve also trained them on a pretty large corpus and I’d like to reuse them rather than start from scratch if I can. The documentation says I’ll need to convert my gensim model to spacy format to use it? Based on some googling it seems like I’ll need to follow the instructions from this stackoverflow followed by the modifications from this github issue?
Should this work or am I better off starting from scratch building new embeddings (maybe with fastText vectors?)
You should definitely be able to load your pre-trained vectors. I’m not sure the code in that StackOverflow thread refers to the current version.
Fundamentally you can always add vectors to spaCy as follows. Let’s say you have a list of word strings, and some sequence of vectors. You can do:
for i, string in enumerate(word_strings):
This might be slow for a large number of vectors, but you should only have to do it this way once. After loading in your vectors, you can save out the nlp object with nlp.to_disk(). Then you can pass that directory to Prodigy.
If you’re using pre-trained vectors, take care not to use the md or lg spaCy data packs. These models use the pre-trained GloVe vectors as features. If you use your own pre-trained vectors, the activations will be different for what the model expects, and you’ll get terrible results. The sm model doesn’t use pre-trained vectors, to make it easy to swap in your own.
You might also be interested in the terms.train-vectors recipe. This uses Gensim to train on a text corpus, and saves out the model for use with spaCy. It should serve as a working example of how that’s done.
Awesome I think this is working! It’s still running through my 1 million word vectors but it worked without any obvious errors on the first 100 so I’m guessing this will work out. Here’s the complete recipe:
from gensim models
word2vec = models.Word2Vec.load('word2vec.model')
import numpy as np
nlp = spacy.load("en_core_web_sm", vectors=False)
rows, cols = 0, 0
for i, line in enumerate(open('word2vec.bin', 'r')):
if i == 0:
rows, cols = line.split()
rows, cols = int(rows), int(cols)
word, *vec = line.split()
vec = np.array([float(i) for i in vec])
@akshitasood63 You can learn vectors with any algorithm. You just need to get the array into numpy, and the list of keys for it.
The only restriction is that the lookup must be ultimately keyed by the lex.orth attribute, so it can’t be context dependent. spaCy has its own way of getting context vectors. You can replace that too, but it’s a bit less convenient (involves subclassing).
You can find the source of this command in the prodigy/recipes/terms.py file. The steps go like this:
Tokenize and pre-process the text using spaCy, with the model provided by the --spacy-model argument. If you don’t set --merge-ents or --merge-nps, it’s okay if the model just uses a tokenizer. If you want to start from an entirely blank model, you could do this:
The terms.train-vectors recipe takes the data source (ideally, lots of text) and will train vectors on that source, reflecting the use of the words in context. It doesn’t really care what those words are – it will simply assign the meaning representations.
If you’re interested in extracting brand names later on, you probably want to set the --merge-nps flag when you train the vectors. This will merge noun phrases into one token, so you’ll end up with more meaningful vectors for names that consist of more than one token. For example, you’ll want a vector for “Coca Cola”, not two vectors for “Coca” and “Cola”.
Prodigy will look at the model’s vocabulary, and will try to find other terms that are similar to your seed terms “Coca Cola, Nike, McDonalds”. As you click through examples and accept and reject them, the target vector will be updated, so Prodigy can keep suggesting you other terms similar to the seed terms and the ones you’ve accepted (but not like the ones you’ve rejected). If your vectors were trained on enough representative text, you’ll quickly be able to find other brand names, i.e. entries in the vocabulary with similar representations to your target vector.
@beckerfuffle@honnibal When I run Michael’s code on my Gensim trained Chinese word vector model, I get the following error-
UnicodeDecodeError Traceback (most recent call last)
2 rows, cols = 0, 0
----> 4 for i, line in enumerate(open(‘wiki.zh.text.simplified_jieba_seg_cbow_w8_mc3.bin’, ‘r’)):
5 if i == 0:
6 rows, cols = line.split()
~/anaconda3/lib/python3.6/codecs.py in decode(self, input, final)
319 # decode input (taking the buffer into account)
320 data = self.buffer + input
–> 321 (result, consumed) = self._buffer_decode(data, self.errors, final)
322 # keep undecoded input until the next call
323 self.buffer = data[consumed:]
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xfd in position 18: invalid start byte