I am using the following to find method similar terms, but it very slow as it has to go through all the vocab. Is sense2vec still the preferred way to do similarity or if I want use gensim, is there way to convert spacy vocab to gensim vocab

def most_similar(word):
    by_similarity = sorted(word.vocab, key=lambda w: word.similarity(w), reverse=True)
    return [w.orth_ for w in by_similarity[:10]]


Looking at the following git link, it appears to be that we can do it, I am not sure how to do it, any help would be appreciated

@honnibal, @ines any guidance would be useful . thanks.

I’m actually having a bit of trouble figuring this out. In theory it should be simple, but I can’t seem to find exactly what we need. (Incidentally this is why I hate inheritance…I alway feel like it gives me too many places to look).

On the spaCy side, the three key data members are:

  • a numpy array with the vector data.

  • nlp.vocab.vectors.key2row: A dict mapping string hashes to rows in the vector table.

  • nlp.vocab.strings: A spacy.strings.StringStore, mapping hashes to strings.

I think we want to create a gensim.models.keyedvectors.WordEmbeddingsKeyedVectors object. Its superclass BaseKeyedVectors sets self.vectors = [], but I doubt it’s really a list when the class is used. I thought I’d be able to look at the save and load code, but that’s in another superclass (utils.SaveLoad), and I’m having trouble chasing down how that works.

My guess is that we’ll be able to replace that self.vectors with the numpy array, and then load the keys into the self.vocab method.

I can’t find an API method that does what we want, but maybe I’m not looking in the right place. The API reference is organized by class, so to get the complete docs for a given class, you have to visit the docs for the superclasses, and remember which methods are overridden. I know RaRe are working on a new docs system to address this, which I’m sure will be done soon — they’ve been pushing lots of great updates to Gensim lately.