terms.train-vectors reads in entire dataset into memory

beckerfuffle · April 9, 2018, 9:04pm

I was reviewing the terms.train-vectors recipe and I noticed that it reads in the entire dataset before calling word2vec:

    for doc in nlp.pipe((eg['text'] for eg in stream)):
        for sent in doc.sents:
            sentences.append([w.text for w in sent])

    print("Extracted {} sentences".format(len(sentences)))
    w2v = Word2Vec(sentences, size=size, window=window, min_count=min_count,
                   sample=1e-5, iter=n_iter, workers=n_workers,
                   negative=negative)

This is extremely inefficient and probably accounts for the out-of-memory errors others have been seeing.

I haven’t adapted it to prodigy yet but I found some of my old code for gensim word2vec which I’ve attached until I have a chance to fix the prodigy recipe:

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from gensim import corpora, models, similarities, utils
from gensim.utils import simple_preprocess, iter_windows
from gensim.summarization.textcleaner import get_sentences
import glob
import pandas as pd

class MakeIter(object):
    def __init__(self, generator_func, *args, **kwargs):
        self.generator_func = generator_func
        self.args = args
        self.kwargs = kwargs
    def __iter__(self):
        return self.generator_func(*self.args, **self.kwargs)

data_dir = 'data_dir'
docs = [f for f in glob.glob(data_dir + '*.json')]

def tokenize(text):
    return [simple_preprocess(sent) for sent in get_sentences(text)]

def yield_docs(filenames):
    for fn in filenames:
        with open(fn, 'r') as f:
            df = pd.read_json(f, orient='columns')
        for note in df['TEXT']:
            for sent in tokenize(note):
                yield sent
        del df

doc_stream = MakeIter(yield_docs, docs)
word2vec = models.Word2Vec(doc_stream, workers=5)
word2vec.save(save_dir + 'word2ec.model')

honnibal · April 10, 2018, 10:14am

@beckerfuffle Thanks!!

I could’ve sworn I tried something like this and was confused that the word2vec class didn’t seem to accept a generator. I forgot that you need to make an iterator class like this, so that it can make multiple passes. I’ll integrate this.

beckerfuffle · April 10, 2018, 1:06pm

Yes it’s a bit of a kludge to say the least and maybe there’s a cleaner way to do it but it worked for me.

chssch · April 10, 2018, 3:21pm

I also worked on this few weeks ago and just want to share my findings regarding this:

I guess you need to build the vocab first before training and (I think a newer version of) gensim needs additional parameters, my changed code looked like this:

w2v = gensim.models.word2vec.Word2Vec(size=size, window=window, min_count=min_count,
               sample=1e-5, sg=sg, workers=n_workers, iter=n_iter)

w2v.build_vocab(get_sents())
w2v.train(get_sents(), total_examples=w2v.corpus_count, epochs=w2v.iter)

Downside is that it takes two passes through the generator, which can be slower if the preprocessing is heavy, but the memory performance will be much better of course.

ines · April 10, 2018, 8:27pm

Just released v1.4.2 which includes the fix mentioned above!

Topic		Replies	Views
Add vectors to nlp model using terms.train-vectors terms , solved	4	1294	April 10, 2018
terms.train-vectors yields entire docs multiple times done , terms	0	559	April 10, 2018
terms.train-vectors: Error loading Word2Vec from Gensim terms , solved , gensim	4	1380	November 29, 2018
Loading gensim word2vec vectors for terms.teach? usage , terms , solved , gensim	17	5128	August 15, 2018
Large Datasets Google Cloud usage , ner , google-cloud	5	1801	October 13, 2018

terms.train-vectors reads in entire dataset into memory

Related topics