sense2vec.teach vectors

I'm trying to use sense2vec.teach for the first time. I'm using prodigy nightly 1.11.0a11. Per the instructions in https://prodi.gy/docs/recipes#terms-teach I used en_core_web_lg as the vectors file:

python -m prodigy sense2vec.teach my_data_set en_core_web_lg --seeds prodigy_data\seeds.txt

However, I get the following:

raise ValueError(f"Can't read file: {location}")
ValueError: Can't read file: en_core_web_lg\cfg

Also, if I try to download en_vectors_web_lg, I get an HTTP 404 error, presumably because the spacy version is 3.1.1

Hi! The sense2vec.teach recipe takes the path to trained sense2vec vectors, not a regular spaCy pipeline. So you want to be downloading one of the pretrained vector packages here and use the path to that instead: GitHub - explosion/sense2vec: 🦆 Contextually-keyed word vectors There's also an example command lower down in the docs.

OK. here's what I want to do, and maybe sense2vec isn't the solution. I would like to create patterns with terms.teach and terms.to-patterns except that terms. teach doesn't appear to handle terms with spaces ("multi word terms"?), as per the documentation (I haven't yet tried it). So I was planning to use sense2vec.teach and terms.to-patterns.. What would you suggest I should do to be able to extract patterns from datasets that contain my annotations. I have a script that extracts the text in spans per label.

I have a very specific (large) set of NEs that I would like to be recognized, that may not appear in reddit discourses. I think that the purpose of patterns is to avoid having billions of examples to train against.

Yes, the problem of word2vec is that it's... well, word2vec :sweat_smile: So you'll only be able to get vectors and compare similarities for individual words, not phrases. sense2vec solves this by training a model on preprocessed text that merges noun phrases and entities and includes labels or part-of-speech tags. This lets you write more specific queries and check similarities for multi-word phrases.

So it does seem like sense2vec would be a good fit for what you're doing. You just need to pick one of the available pretrained vector files and then you can load them in and find other similar terms given a list of seed terms.