Similarity

madhujahagirdar · March 1, 2018, 9:21pm

I am using the following to find method similar terms, but it very slow as it has to go through all the vocab. Is sense2vec still the preferred way to do similarity or if I want use gensim, is there way to convert spacy vocab to gensim vocab

def most_similar(word):
    by_similarity = sorted(word.vocab, key=lambda w: word.similarity(w), reverse=True)
    return [w.orth_ for w in by_similarity[:10]]

print(most_similar(nlp.vocab[u'Malignant']))

==
Looking at the following git link, it appears to be that we can do it, I am not sure how to do it, any help would be appreciated

madhujahagirdar · March 3, 2018, 3:40am

@honnibal, @ines any guidance would be useful . thanks.

honnibal · March 3, 2018, 12:26pm

I’m actually having a bit of trouble figuring this out. In theory it should be simple, but I can’t seem to find exactly what we need. (Incidentally this is why I hate inheritance…I alway feel like it gives me too many places to look).

On the spaCy side, the three key data members are:

nlp.vocab.vectors.data a numpy array with the vector data.
nlp.vocab.vectors.key2row: A dict mapping string hashes to rows in the vector table.
nlp.vocab.strings: A spacy.strings.StringStore, mapping hashes to strings.

I think we want to create a gensim.models.keyedvectors.WordEmbeddingsKeyedVectors object. Its superclass BaseKeyedVectors sets self.vectors = [], but I doubt it’s really a list when the class is used. I thought I’d be able to look at the save and load code, but that’s in another superclass (utils.SaveLoad), and I’m having trouble chasing down how that works.

My guess is that we’ll be able to replace that self.vectors with the numpy array, and then load the keys into the self.vocab method.

I can’t find an API method that does what we want, but maybe I’m not looking in the right place. The API reference is organized by class, so to get the complete docs for a given class, you have to visit the docs for the superclasses, and remember which methods are overridden. I know RaRe are working on a new docs system to address this, which I’m sure will be done soon — they’ve been pushing lots of great updates to Gensim lately.

Topic		Replies	Views
Obtain a list of similar words from my own trained model ner , spacy , off-topic	1	480	September 3, 2020
Prodigy sense2vec.teach recipe with gensim w2vec usage , spacy , terms , solved , sense2vec	3	604	March 6, 2021
Loading gensim word2vec vectors for terms.teach? usage , terms , solved , gensim	17	5143	August 15, 2018
Custom Model Vocab Issues spacy	1	960	March 6, 2018
Convert Gensim FastText to spaCy-readable Word2Vec format for terms.teach recipe spacy , terms , solved , gensim	4	1495	September 11, 2020

Similarity

Related topics