Large Datasets Google Cloud

Has anyone on here got any experience in running any of the recipes on large datasets? And running the recipes on Google Cloud Platform?

I keep hitting out of memory problems when training on my local machine (Macbook Pro 16GB RAM) . I've run out of memory on couple of occasions:

  1. running ner.manual using a model where I have loaded in my own word vectors. My word2vec model file is around 500mb
  2. running ner.batch-train on a dataset that contains approx 15000 annotations.

To get around these problems I want to shift the training off to the cloud, preferably to Google's cloud platform. I've had some experience using ml-engine, but wondered if anyone had some tips or hints for training spacy / using prodigy recipes at scale?

Thanks.

I’m not sure why you should be seeing out of memory errors in ner.manual. I can’t see what the model should be storing — so it sounds to me like something’s not right here. What version are you using?

For batch-train, we do read the whole dataset into memory, and there are more places we could run into problems. 16gb should be enough though. Are the documents very long?

You shouldn’t need to do anything really special to run Prodigy on Google Compute Engine. We actually use this for spaCy and Prodigy all the time. I would recommend getting a machine with high memory, but not particularly many CPUs, as it won’t use multiple cores very efficiently. Remember to open port 8080 in the firewall as well, and change the host setting in your prodigy.json to 0.0.0.0 so that you can connect to the web service. Alternatively, you could use a reverse proxy. I recommend https://traefik.io

Thanks for getting back to me.

Ok sorry, that was my bad - it’s actually ner.batch-train that was running out of memory when I used my word2vec model. I should have checked before I posted; I ran into that problem about a month ago. Although, I’ve just tried it again and it seems to be working - so sorry for the confusion.

For batch-train, we do read the whole dataset into memory, and there are more places we could run into problems. 16gb should be enough though. Are the documents very long?

The mean document size is ~8 words.So I wouldn’t call that very long. I’m using spaCy 2.0.11 and Prodigy 1.4.2. Maybe I didn’t have enough free memory when I ran the batch train.

Thanks for the tips on running all of this on Google - especially around which machines to target.

Hi Mitch,

I’m running into a similar out-of-memory problem (that’s not surprising since my laptop has only 8Mb on memory, clearly not data science specs). But before buying a new PC, how do I run prodigy on the gcp? The GCP instructions suggest it’s just any VM, and you run a command series as on your laptop (python -m ner.teach…). Don’t I have to copy my whole environment (additional sw, data) to the cloud?

thanks,

Andreas

@mitch There’s a minor memory leak in ner.batch-train that will be fixed in the upcoming 1.6 version. Try the current version (1.5.1) as well.

@aph61 You do need to get the data to the VM you’re using, but it acts just like any other server. It’s easiest to install the gcloud tools locally, so you can use that to manage the SSH keys.

I like to use the gcsfuse tool, which lets you mount a bucket as a local directory. This way you can copy files to it from your local machine, and the same files will be available from a directory from your VM. Sort of like dropbox.

Thanks, I’ll give it a try…

best,

Andreas