terms.teach not working for nightly

Hi,
I have spacy 3.0.5 and nightly 1.11.0a. I was following " Training a NAMED ENTITY RECOGNITION MODEL with Prodigy and Transfer Learning" and used command for 1st level of teaching it is working

prodigy sense2vec.teach food_terms ./s2v_reddit_2015_md --seeds "garlic, avocado, cottage cheese, olive oil, cumin, chicken breast, beef, iceberg lettuce

But while I am running

prodigy terms.teach dl_test en_core_web_trf --seeds insults.txt

I am getting the error

ValueError: [E010] Word vectors set to length 0. This may be because you don't have a model installed or loaded, or because your model doesn't include word vectors

I tried with "en_core_web_sm" also I am getting the same error. Could you help me with that?

Also, I would like to convey that I am trying to detect "deep learning" related words from a resume or a white paper. Could you guide me on which dictionary/model to start with for the teaching?

For example, in the "food ingredients" project you used "s2v_reddit_2015_md". I believe "s2v_reddit_2015_md" contains mostly food related discussion?

I tried to run

prodigy sense2vec.teach dl_test ./s2v_reddit_2015_md --seeds "neural network,keras,theano,face detection,convolutional neural network,recurrent neural network,object detection,yolo,gpu,cuda,tensorflow,opencv,computer vision,Region Based Convolutional Neural Networks (R-CNN),single shot detection (ssd),overfeat network,mask R-CNN,mobilenet"

but I got the error ✘ Can't find seed term in vectors

Thanks,
Debojit

Hi!

This is probably a bit unexpected, but the en_core_web_trf model doesn't have "traditional" word vectors installed, because it uses a transformer to do the word embeddings. Likewise, en_core_web_sm is a small model that we provide without word vectors on purpose. Could you try the command again with either en_core_web_md or en_core_web_lg?

Also, I would like to convey that I am trying to detect "deep learning" related words from a resume or a white paper. Could you guide me on which dictionary/model to start with for the teaching?
For example, in the "food ingredients" project you used "s2v_reddit_2015_md". I believe "s2v_reddit_2015_md" contains mostly food related discussion?

s2v_reddit_2015_md is a generic model containing vectors pretrained on Reddit comments from 2015. As far as I know, these aren't specific for food-related words, and in fact should be pretty usable for technology terms too (but as always, the proof is in the pudding! :wink: )

I got the error ✘ Can't find seed term in vectors

Is this the verbatim error message? I'd expect this error message to specify exactly which term wasn't found. Could you paste the whole error message & stack trace, if available?

This is probably a bit unexpected, but the en_core_web_trf model doesn't have "traditional" word vectors installed, because it uses a transformer to do the word embeddings. Likewise, en_core_web_sm is a small model that we provide without word vectors on purpose. Could you try the command again with either en_core_web_md or en_core_web_lg?

I am still getting the error en_core_web_sm but it is working for en_core_web_lg. But I hardly get any match with en_core_web_lg

Is this the verbatim error message? I'd expect this error message to specify exactly which term wasn't found. Could you paste the whole error message & stack trace, if available?

Yes you are correct it mentioned the exact term which was not found. But the tragedy is I used the below command

prodigy sense2vec.teach dl_test ./s2v_reddit_2015_md --seeds "neural network,keras,theano,face detection,convolutional neural network,recurrent neural network,object detection,yolo,gpu,cuda,tensorflow,opencv,computer vision,Region Based Convolutional Neural Networks (R-CNN),single shot detection (ssd),overfeat network,mask R-CNN,mobilenet"

and after deleting several terms finally it came to

prodigy sense2vec.teach dl_test ./s2v_reddit_2015_md --seeds "neural network,theano,face detection,gpu,cuda,opencv,computer vision"

So, I had to remove most of the terms. That is why I was asking for suggestions on which model/dictionary will be best for technical training.

Thanks,
Debojit

I just had a look at the seed terms and it's not so surprising that many of them weren't found in the 2015 vectors:

For example, TensorFlow and Keras were only released in 2015 :sweat_smile: So there would have been very few mentions of those terms on Reddit up to 2015 and the terms probably wouldn't have made the frequency cut. Similarly, the string "Region Based Convolutional Neural Networks (R-CNN)" is very long and specific and it's not a proper noun or entity that would have been merged during the sense2vec training. So there's kinda no way that it would have a vector. The same goes for a lot of the other terms.

1 Like