terms.train-vectors: CPU cores and optimal number of workers

My goal: I would like to calculate word embeddings in a fast way. So definitely performing calculations using one cpu core is not sufficient.

I would like to run calculations on multiple CPU cores.

Question How to run calculations on multiple CPU cores? terms.train-vectors seems to have just one parameter that is related to this problem. I mean --n-workers parameter.

The above screenshot was made on AWS c5.12xlarge instance (48 vCPUs, 96 Gb RAM). --n-workers parameter was equal to 4. So just one vCPU was used.

The terms.train-vectors recipe uses Gensim to do the word vector training. Some of the Gensim preprocessing is in Python, which makes it rather slow. Once the tokenization and vocabulary counting is done, it should use multiple cores for the actual training.

If Gensim is too slow, you might want to look into using the FastText or GloVe libraries instead. This guide might help you, especially if you're looking at using the --merge-phrases and --merge-entities arguments: https://github.com/explosion/sense2vec#-training-your-own-sense2vec-vectors .

Honestly though, it probably won't take that long on most corpora. You can just rent a smaller machine and leave it running for a day. You don't have to run the word vector training very often, so it's not so bad if it's a bit slow.

You can also train word vectors directly with whatever tool you like, and then convert the vectors into a spaCy model. The spacy init-model command has a --vectors argument that takes a word vectors file produced by most of the standard tools.