We had corpus (word2vec 3GB ) from PubMed-and-PMC focused on Biomedical information. I wanted to use this corpus, which was generic biomedical and build model focused on radiology reports. The radiology reports amounts to 4GB of data with 3 million unique entries. I was using the following command to build a word2vec model. I have 48 core machine with 200GB ram. This runs for 4 hours and consumes most of the cores and run out of memory after 4 hrs. Not sure if there is a memory leak or does it need more resources. Is there way we can debug this or dump any logs to identify issue.
nohup python -m prodigy terms.train-vectors /home/ubuntu/cnn-annotation/model/radiologymodel /home/ubuntu/cnn-annotation/InstallPackages/source/allReporttext.txt --spacy-model /home/ubuntu/cnn-annotation/InstallPackages/model/pmcmodel/PubMed-and-PMC-w2v-spacy.bin --size 300 --merge-nps --merge-ents &