Word vectors: How do they work?

Hi folks,
I’m trying to wrap my head around how the word vectors are used by prodigy and if it’s worth it (or even advisable) to use custom vectors?

I’ve created my own model based on en_core_web_sm but with the word vectors replaced with my own gensim vectors (see here). This seems to work fine for the terms.teach task but based on what I’ve been reading on these boards I’m becoming doubtful that this is useful for other tasks (e.g. textcat.teach). Am I correct that just replacing the word vectors alone is not sufficient since my gensim model almost certainly preprocessed the data differently? I tried my model out on textcat.teach and I was getting really strange results even after several hundred annotations (most scores were either very close to 0.0 or 1.0 and there didn’t seem to be any logical reason for this).

Secondly I’ve been following several recent threads about terms.train-vectors being broken in various ways. Is it even possible to train custom vectors with prodigy at the moment?

The text I’m working with is very domain specific with lots of medical and scientific jargon so I’m concerned that using one of the pre-built spacy models would be a bad fit but I’m not convinced I really am better off building my own at this point.

I’m looking for advice here… should I build my own models with my own domain specific word vectors? If so what’s the least error prone method for doing so? If I should use pre-built models instead, which of the 4 different en models should I use?

Apologies for any instabilities around this. The problem is that the train-vectors command is kind of an odd fit with the rest of the Prodigy CLI: it’s a long-running batch-job that consists of several stages and a lot of options, while the rest of the Prodigy commands are about interactivity or starting a service.

I think you can take the train-vectors command as a starting point, that you can modify for your requirements. It shows how to pre-process the text with spaCy, and how to save the model for use in spaCy. The other really useful way to get word vectors into a spaCy model is with the spacy init-model command. In recent versions of spaCy, this takes a .tgz or .zip file in word2vec’s text format, which most tools can produce. This is the native output of FastText, for instance.

This might matter less than you think — although maybe on medical text it’s more significant. You should probably compare the tokenizer outputs and have a look.

What did your similarities look like, after you trained your model? Like, was your word2vec model any good in general?

You might try using Prodigy to do some evaluation, if you don’t have another way of evaluating your vectors. A lot of folks do ratings between 1-5, and then check how much the human correlates with the model. I dislike this because I think keeping a consistent scale is really weird and hard. How related should “apple” and “banquet” be? Like, they’re both nouns about food…But they’re not synonyms or co-hyponyms.

Here’s another idea you might try instead: make tasks that ask, “Which of these two words is most similar to this word?”. Instead of questions like, “Rate the similarity of ‘banquet’ and ‘apple’ 1-5”, you’d be asked “Which of these is more similar to ‘banquet’: ‘apple’, or ‘fiduciary’?.”

First, randomly draw a word from the vocab, and then two other words. You probably want to have one word that the one the model thinks is pretty similar, and another one the model thinks is pretty dissimilar. Maybe thresholds of >0.7 and <0.3 would be good. Randomly allocate the two candidates to “accept” and “reject”, like we do in ner.eval-ab. If you think the green word is more similar to the target, you click “Accept”, if the red word is more similar, you click “Reject”. At the end of the session, you find out how often the model agreed with you.

1 Like