Trying to re-train sense2vec

ysz · January 7, 2023, 2:23pm

I'm trying to re-train sense2vec on my data but how does sense2vec work really?

With my new vectors I get Can't find seed term 'adjusted EBITDA' in vectors when I'm trying to use sense2vec.teach with that term as seed but I'm sure it was there in the text.

Even for the reddit 2015 data it does not find similar words like this:

reddit = Sense2Vec().from_disk("/Users/ysz/data/s2v_reddit_2015_md")
most_similar(reddit, "cottage cheese|NOUN")
AssertionError

where

def most_similar(s2v, query):
    assert query in s2v
    vector = s2v[query]
    freq = s2v.get_freq(query)
    most_similar = s2v.most_similar(query, n=3)
    return most_similar

koaning · January 9, 2023, 10:05am

You're typically querying a phrase in sense2vec, so you probably want to query:

most_similar(reddit, "cottage_cheese|NOUN")

There's a few "tricks" in sense2vec, but if you have a look at the train scripts you'll notice how the noun-chunks are clumped together..

The trick is to merge noun-chunks into a single token, from there we treat it as if it is a single token such that things like star wars become star_wars. That's also why the example on the readme lists this example:

query = "natural_language_processing|NOUN"

You really need the _ to link words together as a single phrase. Because phrases are turned into single tokens with context (via |NOUN) we're also able to train embeddings for them.

Details

I've tried training my own sense2vec model in the past and I have come to the realisation that you really need a lot of data to get sensible phrases to pop up. You really need to have a large corpus with specific phrases that are unlike Reddit to really get a benefit out of training your own.

Feel free to ask more questions about this if you find it's interesting.

Topic		Replies	Views
Prodigy sense2vec.teach recipe with gensim w2vec usage , spacy , terms , solved , sense2vec	3	604	March 6, 2021
sense2vec training questions ner , spacy , sense2vec	2	558	April 13, 2022
Can't find seed term usage , ner , to-be-released , sense2vec	2	670	April 14, 2021
Obtain a list of similar words from my own trained model ner , spacy , off-topic	1	480	September 3, 2020
custom sense2vec usage	5	1417	August 15, 2021

Trying to re-train sense2vec

Details

Related topics