Terms Trains Crashing

We had corpus (word2vec 3GB ) from PubMed-and-PMC focused on Biomedical information. I wanted to use this corpus, which was generic biomedical and build model focused on radiology reports. The radiology reports amounts to 4GB of data with 3 million unique entries. I was using the following command to build a word2vec model. I have 48 core machine with 200GB ram. This runs for 4 hours and consumes most of the cores and run out of memory after 4 hrs. Not sure if there is a memory leak or does it need more resources. Is there way we can debug this or dump any logs to identify issue.

nohup python -m prodigy terms.train-vectors /home/ubuntu/cnn-annotation/model/radiologymodel /home/ubuntu/cnn-annotation/InstallPackages/source/allReporttext.txt --spacy-model /home/ubuntu/cnn-annotation/InstallPackages/model/pmcmodel/PubMed-and-PMC-w2v-spacy.bin --size 300 --merge-nps --merge-ents &

It’s definitely a memory leak — your resources should be fine for that workload.

You can enable logging with the export PRODIGY_LOGGING=basic or export PRODIGY_LOGGING=verbose

The source for the recipes is available in `prodgy/recipes/’. You can either unzip the .whl file, or you can get it from your installation — so you can also add print statements.

The question is whether it runs out of memory during the spaCy pre-processing, or whether it runs out of memory when it passes the text to Gensim to train the vectors. I would expect Gensim to easily be able to handle 4gb of text. At the same time, what we’re accumulating from spaCy is just the text objects, so the working set there shouldn’t be so large either.

Try running with a sample of your text and watching the memory usage at each stage. The problem’s likely to be non-linear in the amount of text, so small samples might not reflect the problem — but maybe they do.

By the way, does your spaCy model have an entity recogniser and parser loaded? If not, the merge-nps and merge-ents options won’t work.

Yest night, I ran without merge-nps and merge-ents still it runs out of memory.

I will try to do this and report back.

Great, thanks. For ease of reference, you can always get the path to the file being run with python -c "import prodigy.recipes.terms; print(prodigy.recipes.terms.__file__)"

1 Like

In terms.py, The log output does not log print(“Extracted {} sentences”.format(len(sentences))). My thoughts are that for loop where the sentences are appended is where the memory leak is occurring and causing the crash.

print(“Generating strea now …”,loader)
stream = get_stream(source, loader=loader, input_key=‘text’)
print(“Finished Generating strea now …”)
sentences = []
for doc in nlp.pipe((eg[‘text’] for eg in stream)):
for sent in doc.sents:
sentences.append([w.text for w in doc])
print(“Extracted {} sentences”.format(len(sentences)))

What if you write out the sentences to a file, instead of accumulating the list? So something like

with open('/tmp/output.txt', 'w') as file_:
    for doc in nlp.pipe((eg['text'] for eg in steam)):
        for sent in doc.sents:
            file_.write(sent.text.replace('\n', ' ') + '\n')
with open('/tmp/output.txt') as file_:
    sentences = file_.read().split('\n')

If we don’t get to read the sentences back in, there’s definitely a memory leak in spaCy.

Let me try this and report back in next 4 hrs :slight_smile:

If there a memory leak, any workaround ? Can I build word2vec using Gensim and then convert to spacy model would that work ?

You can definitely build the word2vec in Gensim, FastText, GloVe etc, and then load into spaCy afterwards.

The thing is, that’s actually what this recipe is doing: spaCy doesn’t learn the vectors, we’re just pre-processing the text here. And we do want to use spaCy’s text pre-processing, because we want to make sure we’re loading back terms that have been segmented the same way.

For intance, if spaCy segments the string don't as ["do", "n't"], we want to learn vectors for those two tokens. If the word vector package has preprocessed the text to produce ["don't"], that will be the key in the vector table — so we won’t get the right vectors back out when we try to associate text to vectors.

This is how the merging works, by the way — we’re just pre-processing the text, to produce different tokens. Then those tokens are passed forward to Gensim’s word2vec. Gensim tries to learn a word vector for each word type. If we tell it “anterior cruciate ligament” is a token, it’ll learn a vector for that. Then we can load the table back in, and so long as we can apply the same pre-processing on the text to get “anterior cruciate ligament” to be a token, we can access the vector for it.

Got it ! So, if it is memory leak issue then I would wait for the fix in Spacy.

I tried this out and found that /tmp/output.txt is never read. Appears to be memory leak in Spacy. Let me know how to go about ! Additionally, it ran out memory after 319k documents out of 3 million.

Could you paste me /home/ubuntu/cnn-annotation/InstallPackages/model/pmcmodel/PubMed-and-PMC-w2v-spacy.bin/meta.json?

Let’s also simplify the script we’re working with to make this easier to reason about. Does the memory leak still happen if you run this?

def main():
    source_loc = '/home/ubuntu/cnn-annotation/InstallPackages/source/allReporttext.txt'
    spacy_loc = '/home/ubuntu/cnn-annotation/InstallPackages/model/pmcmodel/PubMed-and-PMC-w2v-spacy.bin'
    output_loc = '/tmp/sentences.txt'

    nlp = spacy.load(spacy_loc)

    with open(output_loc, 'w') as output_file:
        i = 0
        with open(source_loc) as input_file:
            texts = (line.strip() for line in input_file)
            for doc in nlp.pipe(texts):
                for sent in doc.sents:
                    output_file.write(sent.text + '\n')
                i += 1
                if i and i % 1000 == 0:
                    print('%d lines processed' % i)

if __name__ == '__main__':

If the memory leak still occurs, does it still happen if you add nlp.disable_pipes(*nlp.pipe_names) after loading the NLP model?

The next step will be to inspect the object graph to figure out just what is sticking around. That’s a bit annoying to do, so I thought we’d start with these basic checks first.

“license”:“CC BY-SA 3.0”,
“author”:“Explosion AI”,
“OntoNotes 5”,
“Common Crawl”
“description”:“English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities.”


Surely something to do with the custom vectors. If you run without your vectors by just using spacy.load('en_core_web_sm'), can you confirm there’s no leak?

I’ll open an issue for this on the spaCy tracker, probably later today. Hopefully I’ll have a runtime workaround, but if not I can upload a development build for you to pip install.

This does not cause any memory leak. I ran it for an hour and it did not lead to memory spike.

Thanks! Closing in on this…

Now add:

from prodigy.components import preprocess
nlp.add_pipe(preprocess.merge_entities, name='merge_entities')
nlp.add_pipe(preprocess.merge_noun_chunks, name='merge_noun_chunks')

This should be added after the call to spacy.load(). If this leaks, the problem is in spaCy’s doc.merge(). If it doesn’t leak, the problem is in Prodigy, somewhere innocuous – possibly the text loader or something.

Even without merge_entities and merge_noun_chunks we had memory leak. Should i still try this ?

As a sanity check I think it would be useful.

I wrote simple code with blank model and it gives me memory leak.

from future import unicode_literals
import plac
import numpy
from gensim.models.word2vec import Word2Vec
from gensim.models import KeyedVectors
import spacy
from prodigy.components import preprocess
from prodigy.components.loaders import get_stream

nlp = spacy.blank(“en”)

#doc = nlp(“testing is good habit. I am doing great”)

stream = get_stream("/Users/philips/Development/BigData/RS/word2vecreport/report/allReporttext.txt", loader=None, input_key=‘text’)
sentences = []
count = 0

for doc in nlp.pipe((eg[‘text’] for eg in stream)):
count = count + 1
for sent in doc.sents:
sentences.append([w.text for w in doc])

w2v = Word2Vec(sentences, size=300, window=5, min_count=10,
sample=1e-5, iter=1, workers=6,

output_model = “/Users/philips/Development/BigData/RS/word2vecreport/report/pmcmodel”
for word in w2v.wv.vocab:
nlp.vocab.set_vector(word, w2v.wv.word_vec(word))

print(‘Trained Word2Vec model’, output_model.resolve())