Prodigy sense2vec.teach recipe with gensim w2vec

Hi,

based on the sense2vec.teach

prodigy sense2vec.teach food_terms data/s2v_reddit_2015_md/ --seeds "garlic, avocado, cottage cheese, olive oil, cumin, chicken breast,beef, iceberg lettuce"

I would like to use gensim w2vec generated model to generate similar words/phrases to the given seed words/phrases

In order to make this smoothly work, the gensim w2vec model should has the same format as the "s2v_reddit_2015_md/" model? if so how to convert the gensim w2vec model format to be compatible with the sense2vec model? or is there any other way to achieve this goal?

Hi! A sense2vec model is essentially just a word2vec model trained on words/phrases with concatenated POS tags or entity labels. But in order to query that, the sense2vec library includes various methods so you can look up words with tags etc.

If you just have a regular w2v model, using it via the sense2vec library doesn't make that much sense and a lot of the assumptions in the sense2vec.teach workflow don't really hold up either because your vectors couldn't be queried by tags. I think a better solution would be one of the following:

  • Add your word vectors to a blank spaCy pipeline and use it with terms.teach.
  • Write your own recipe script that loads your vectors, calculates the average for the vectors of all seed terms and then find the most similar entries in your table and send them out. See here for the most_similar implementation in spaCy.

Thanks a lot

This what I was looking for. I have a complicated scientific text and using sense2vec provided poor result. POS tag doesn't work well.

Glad to hear! And yeah, the sense2vec vectors we trained were trained on Reddit text, which is pretty far from scientific texts :sweat_smile: