terms.train-vectors: CPU cores and optimal number of workers

andrei.volkau · December 11, 2019, 10:56am

My goal: I would like to calculate word embeddings in a fast way. So definitely performing calculations using one cpu core is not sufficient.

I would like to run calculations on multiple CPU cores.

Question How to run calculations on multiple CPU cores? terms.train-vectors seems to have just one parameter that is related to this problem. I mean --n-workers parameter.

The above screenshot was made on AWS c5.12xlarge instance (48 vCPUs, 96 Gb RAM). --n-workers parameter was equal to 4. So just one vCPU was used.

honnibal · December 11, 2019, 10:48pm

The terms.train-vectors recipe uses Gensim to do the word vector training. Some of the Gensim preprocessing is in Python, which makes it rather slow. Once the tokenization and vocabulary counting is done, it should use multiple cores for the actual training.

If Gensim is too slow, you might want to look into using the FastText or GloVe libraries instead. This guide might help you, especially if you're looking at using the --merge-phrases and --merge-entities arguments: https://github.com/explosion/sense2vec#-training-your-own-sense2vec-vectors .

Honestly though, it probably won't take that long on most corpora. You can just rent a smaller machine and leave it running for a day. You don't have to run the word vector training very often, so it's not so bad if it's a bit slow.

You can also train word vectors directly with whatever tool you like, and then convert the vectors into a spaCy model. The spacy init-model command has a --vectors argument that takes a word vectors file produced by most of the standard tools.

Topic		Replies	Views
Loading gensim word2vec vectors for terms.teach? usage , terms , solved , gensim	17	5136	August 15, 2018
How can i use all the cores of CPU for training model of spancat usage , spacy , training , spancat	3	477	June 24, 2022
Add vectors to nlp model using terms.train-vectors terms , solved	4	1294	April 10, 2018
Custom Word Vectors usage , spacy , solved	1	449	February 24, 2020
Word vectors: How do they work? usage	1	1435	April 8, 2018

terms.train-vectors: CPU cores and optimal number of workers

Related topics