We had corpus (word2vec 3GB ) from PubMed-and-PMC focused on Biomedical information. I wanted to use this corpus, which was generic biomedical and build model focused on radiology reports. The radiology reports amounts to 4GB of data with 3 million unique entries. I was using the following command to build a word2vec model. I have 48 core machine with 200GB ram. This runs for 4 hours and consumes most of the cores and run out of memory after 4 hrs. Not sure if there is a memory leak or does it need more resources. Is there way we can debug this or dump any logs to identify issue.
Itâs definitely a memory leak â your resources should be fine for that workload.
You can enable logging with the export PRODIGY_LOGGING=basic or export PRODIGY_LOGGING=verbose
The source for the recipes is available in `prodgy/recipes/â. You can either unzip the .whl file, or you can get it from your installation â so you can also add print statements.
The question is whether it runs out of memory during the spaCy pre-processing, or whether it runs out of memory when it passes the text to Gensim to train the vectors. I would expect Gensim to easily be able to handle 4gb of text. At the same time, what weâre accumulating from spaCy is just the text objects, so the working set there shouldnât be so large either.
Try running with a sample of your text and watching the memory usage at each stage. The problemâs likely to be non-linear in the amount of text, so small samples might not reflect the problem â but maybe they do.
By the way, does your spaCy model have an entity recogniser and parser loaded? If not, the merge-nps and merge-ents options wonât work.
Great, thanks. For ease of reference, you can always get the path to the file being run with python -c "import prodigy.recipes.terms; print(prodigy.recipes.terms.__file__)"
In terms.py, The log output does not log print(âExtracted {} sentencesâ.format(len(sentences))). My thoughts are that for loop where the sentences are appended is where the memory leak is occurring and causing the crash.
print(âGenerating strea now âŚâ,loader)
stream = get_stream(source, loader=loader, input_key=âtextâ)
print(âFinished Generating strea now âŚâ)
sentences = []
for doc in nlp.pipe((eg[âtextâ] for eg in stream)):
for sent in doc.sents:
sentences.append([w.text for w in doc])
print(âExtracted {} sentencesâ.format(len(sentences)))
What if you write out the sentences to a file, instead of accumulating the list? So something like
with open('/tmp/output.txt', 'w') as file_:
for doc in nlp.pipe((eg['text'] for eg in steam)):
for sent in doc.sents:
file_.write(sent.text.replace('\n', ' ') + '\n')
with open('/tmp/output.txt') as file_:
sentences = file_.read().split('\n')
If we donât get to read the sentences back in, thereâs definitely a memory leak in spaCy.
You can definitely build the word2vec in Gensim, FastText, GloVe etc, and then load into spaCy afterwards.
The thing is, thatâs actually what this recipe is doing: spaCy doesnât learn the vectors, weâre just pre-processing the text here. And we do want to use spaCyâs text pre-processing, because we want to make sure weâre loading back terms that have been segmented the same way.
For intance, if spaCy segments the string don't as ["do", "n't"], we want to learn vectors for those two tokens. If the word vector package has preprocessed the text to produce ["don't"], that will be the key in the vector table â so we wonât get the right vectors back out when we try to associate text to vectors.
This is how the merging works, by the way â weâre just pre-processing the text, to produce different tokens. Then those tokens are passed forward to Gensimâs word2vec. Gensim tries to learn a word vector for each word type. If we tell it âanterior cruciate ligamentâ is a token, itâll learn a vector for that. Then we can load the table back in, and so long as we can apply the same pre-processing on the text to get âanterior cruciate ligamentâ to be a token, we can access the vector for it.
I tried this out and found that /tmp/output.txt is never read. Appears to be memory leak in Spacy. Let me know how to go about ! Additionally, it ran out memory after 319k documents out of 3 million.
Could you paste me /home/ubuntu/cnn-annotation/InstallPackages/model/pmcmodel/PubMed-and-PMC-w2v-spacy.bin/meta.json?
Letâs also simplify the script weâre working with to make this easier to reason about. Does the memory leak still happen if you run this?
def main():
source_loc = '/home/ubuntu/cnn-annotation/InstallPackages/source/allReporttext.txt'
spacy_loc = '/home/ubuntu/cnn-annotation/InstallPackages/model/pmcmodel/PubMed-and-PMC-w2v-spacy.bin'
output_loc = '/tmp/sentences.txt'
nlp = spacy.load(spacy_loc)
with open(output_loc, 'w') as output_file:
i = 0
with open(source_loc) as input_file:
texts = (line.strip() for line in input_file)
for doc in nlp.pipe(texts):
for sent in doc.sents:
output_file.write(sent.text + '\n')
i += 1
if i and i % 1000 == 0:
print('%d lines processed' % i)
if __name__ == '__main__':
main()
If the memory leak still occurs, does it still happen if you add nlp.disable_pipes(*nlp.pipe_names) after loading the NLP model?
The next step will be to inspect the object graph to figure out just what is sticking around. Thatâs a bit annoying to do, so I thought weâd start with these basic checks first.
Surely something to do with the custom vectors. If you run without your vectors by just using spacy.load('en_core_web_sm'), can you confirm thereâs no leak?
Iâll open an issue for this on the spaCy tracker, probably later today. Hopefully Iâll have a runtime workaround, but if not I can upload a development build for you to pip install.
from prodigy.components import preprocess
nlp.add_pipe(preprocess.merge_entities, name='merge_entities')
nlp.add_pipe(preprocess.merge_noun_chunks, name='merge_noun_chunks')
This should be added after the call to spacy.load(). If this leaks, the problem is in spaCyâs doc.merge(). If it doesnât leak, the problem is in Prodigy, somewhere innocuous â possibly the text loader or something.
nlp.vocab.reset_vectors(width=300)
output_model = â/Users/philips/Development/BigData/RS/word2vecreport/report/pmcmodelâ
for word in w2v.wv.vocab:
nlp.vocab.set_vector(word, w2v.wv.word_vec(word))