terms.train-vectors reads in entire dataset into memory

I was reviewing the terms.train-vectors recipe and I noticed that it reads in the entire dataset before calling word2vec:

    for doc in nlp.pipe((eg['text'] for eg in stream)):
        for sent in doc.sents:
            sentences.append([w.text for w in sent])

    print("Extracted {} sentences".format(len(sentences)))
    w2v = Word2Vec(sentences, size=size, window=window, min_count=min_count,
                   sample=1e-5, iter=n_iter, workers=n_workers,
                   negative=negative)

This is extremely inefficient and probably accounts for the out-of-memory errors others have been seeing.

I haven’t adapted it to prodigy yet but I found some of my old code for gensim word2vec which I’ve attached until I have a chance to fix the prodigy recipe:

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from gensim import corpora, models, similarities, utils
from gensim.utils import simple_preprocess, iter_windows
from gensim.summarization.textcleaner import get_sentences
import glob
import pandas as pd

class MakeIter(object):
    def __init__(self, generator_func, *args, **kwargs):
        self.generator_func = generator_func
        self.args = args
        self.kwargs = kwargs
    def __iter__(self):
        return self.generator_func(*self.args, **self.kwargs)

data_dir = 'data_dir'
docs = [f for f in glob.glob(data_dir + '*.json')]

def tokenize(text):
    return [simple_preprocess(sent) for sent in get_sentences(text)]

def yield_docs(filenames):
    for fn in filenames:
        with open(fn, 'r') as f:
            df = pd.read_json(f, orient='columns')
        for note in df['TEXT']:
            for sent in tokenize(note):
                yield sent
        del df

doc_stream = MakeIter(yield_docs, docs)
word2vec = models.Word2Vec(doc_stream, workers=5)
word2vec.save(save_dir + 'word2ec.model')

@beckerfuffle Thanks!!

I could’ve sworn I tried something like this and was confused that the word2vec class didn’t seem to accept a generator. I forgot that you need to make an iterator class like this, so that it can make multiple passes. I’ll integrate this.

1 Like

Yes it’s a bit of a kludge to say the least and maybe there’s a cleaner way to do it but it worked for me.

I also worked on this few weeks ago and just want to share my findings regarding this:

I guess you need to build the vocab first before training and (I think a newer version of) gensim needs additional parameters, my changed code looked like this:

w2v = gensim.models.word2vec.Word2Vec(size=size, window=window, min_count=min_count,
               sample=1e-5, sg=sg, workers=n_workers, iter=n_iter)

w2v.build_vocab(get_sents())
w2v.train(get_sents(), total_examples=w2v.corpus_count, epochs=w2v.iter)

Downside is that it takes two passes through the generator, which can be slower if the preprocessing is heavy, but the memory performance will be much better of course.

Just released v1.4.2 which includes the fix mentioned above!

1 Like