Terms Trains Crashing

Also what I don't understand is we are iterating through sentences but not using the sent object. This might be the root cause of the issue.

For example I had sample file with below text:
testing the sentences. there might be a memory leak. This will repeat three times for each sentences.

the sentence array contained the same tokens repeated 3 times as there were three sentences.

[
['testing', 'the', 'sentences', '.', 'there', 'might', 'be', 'a', 'memory', 'leak', '.', 'This', 'will', 'repeat', 'there', 'times', 'for', 'each', 'sentences', '.'],

['testing', 'the', 'sentences', '.', 'there', 'might', 'be', 'a', 'memory', 'leak', '.', 'This', 'will', 'repeat', 'there', 'times', 'for', 'each', 'sentences', '.'],

['testing', 'the', 'sentences', '.', 'there', 'might', 'be', 'a', 'memory', 'leak', '.', 'This', 'will', 'repeat', 'there', 'times', 'for', 'each', 'sentences', '.']
]

This might be the root cause of the memory issue.

That’s definitely a bug! It should be [w.text for w in sent]!

I don’t understand why my linter didn’t pike that up — usually we run frosted, which identified unused variables.

You can’t use two sets of pre-trained word vectors in the same model. If you have vectors from your PMC model, you can use those — or, alternatively, you can train new vectors which you would use instead.

This applies to both spaCy and Gensim: you can only use two sets of vectors separately, not together, because they don’t map the words to any consistent space. It’s not a software limitation; that’s just how these things work.

Another OOM issue. The document size is around 4million and around 27GB. Machine configuration is 60GB and 16cores. I am unable to proceed further. Any workaround.

File “/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/thinc/api.py”, line 55, in predict
X = layer(X)
File “/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/thinc/neural/_classes/model.py”, line 161, in call
return self.predict(x)
File “/home/madhujahagirdar/bionlp-gpu/venv/lib/python3.5/site-packages/thinc/neural/_classes/convolution.py”, line 25, in predict
return self.ops.seq2col(X, self.nW)
File “ops.pyx”, line 462, in thinc.neural.ops.NumpyOps.seq2col
File “cymem/cymem.pyx”, line 42, in cymem.cymem.Pool.alloc (cymem/cymem.cpp:1091)
MemoryError: Error assigning 18446744072542178688 bytesError assigning 18446744072542178688 bytesError assigning 18446744072542178688 bytesError assigning 18446744072542178688 bytes
3166000

I have split the documents into Max 100 lines now, still I am running to OOM issue.

@ines, @honnibal. Spacy and Prodigy are both awesome tools and I would love to use it. With the issues what I am having to train and use it in production, I am unable to proceed further due to OOM issues. I would love some more support. Let me know if there are other ways to get support on this (like enterprise license etc.,)

@madhujahagirdar That error is coming from the convolutional layers, trying to assign a really enormous block of memory. Are you still running terms.train-vectors? If you’ve emptied the pipeline there should be no reason you’d be calling into that function.

More importantly, why is the allocation so large? What’s the maximum length of the documents you’re working with, in number of words?

Unfortunately we can’t offer a support contract, as we need to prioritise any bugs or feature requests based on their impact, no matter who reports them.

I have around 4 million documents, and each document size is 100 lines (roughly around 8-10kb) and total corpus size is 27GB. I am trying to create word2vec model using spacy and terms.train. After processing 300k documents it consumes all 60GB and runs out of memory.

I have disabled all the pipes, except sentencizer: (in terms.py)

nlp = spacy.load(spacy_model)
nlp.disable_pipes(*nlp.pipe_names)
nlp.add_pipe(nlp.create_pipe(‘sentencizer’))

and also reduced the batch size

for doc in nlp.pipe((eg[‘text’] for eg in stream),batch_size=25,n_threads=-1):

and fixed the loop which was causing potential mem leak: (refered above)
for sent in doc.sents:
sentences.append([w.text for w in sent])

Even with all these, after 300k it consumes all 60GB and causes OOM.

The error above indicates that the pipeline is somehow not disabled — the convolutional layers are being applied, which means that there’s some model being used.

Your corpus is quite large. Do you run out of memory if you just train the word vectors with Gensim directly? RaRe, the makers of Gensim, do offer support — you could contact them here: https://rare-technologies.com/contact/

The error above is not after disabling the pipelines, it was before. The kernel kills the python process now as it consumes all the Memory.

I can train word2vec using gensim, without any memory issues, however, once I build in gensim, I have another issue in batch train ( i have reported this as well ) which crashes due to OOM as well if I use model from gensim. https://support.prodi.gy/t/batch-train-buffer-full/373/12

It’s a Pub-Med corpus and publically available data. If you can give me repo, I can upload the corups.

I'm confused --- so if you do disable the pipeline, does the training work?

Also can you try just training with Gensim but setting the dimensionality quite small (e.g. 64 dimensions), and a high frequency threshold? Then you should be able to get a smaller vector model, to keep working. It shouldn't affect accuracy that much to have a word vector model that's like 300mb.

OK, Will switch to gensim for word2vec, but after I build word2vec with gensim I cannot use prodigy to do text classification due to another issue with batch.train OOM https://support.prodi.gy/t/batch-train-buffer-full/373/12

You can still use Prodigy for text classification. If you use a smaller vector model this issue with the serialization won’t occur, preventing the problem you were initially trying to work around. If necessary you can just not use the word vectors — they’ll improve accuracy, but it’s still definitely possible to train without them.