Trying to re-train sense2vec

I'm trying to re-train sense2vec on my data but how does sense2vec work really?

With my new vectors I get Can't find seed term 'adjusted EBITDA' in vectors when I'm trying to use sense2vec.teach with that term as seed but I'm sure it was there in the text.

Even for the reddit 2015 data it does not find similar words like this:

reddit = Sense2Vec().from_disk("/Users/ysz/data/s2v_reddit_2015_md")
most_similar(reddit, "cottage cheese|NOUN")
AssertionError

where

def most_similar(s2v, query):
    assert query in s2v
    vector = s2v[query]
    freq = s2v.get_freq(query)
    most_similar = s2v.most_similar(query, n=3)
    return most_similar

You're typically querying a phrase in sense2vec, so you probably want to query:

most_similar(reddit, "cottage_cheese|NOUN")

There's a few "tricks" in sense2vec, but if you have a look at the train scripts you'll notice how the noun-chunks are clumped together..

The trick is to merge noun-chunks into a single token, from there we treat it as if it is a single token such that things like star wars become star_wars. That's also why the example on the readme lists this example:

query = "natural_language_processing|NOUN"

You really need the _ to link words together as a single phrase. Because phrases are turned into single tokens with context (via |NOUN) we're also able to train embeddings for them.

Details

I've tried training my own sense2vec model in the past and I have come to the realisation that you really need a lot of data to get sensible phrases to pop up. You really need to have a large corpus with specific phrases that are unlike Reddit to really get a benefit out of training your own.

Feel free to ask more questions about this if you find it's interesting.

1 Like