Prodigy NER train recipe getting killed for no apparent reason

Hello everyone,

I am about to train a NER model in Prodigy, for which I have 6 datasets available (they could be more), obtained through ner.manual. Some aspects about this data:

  • Each file has 1000 samples.
  • The text in each sample is "lengthy": has ~3500 characters or ~450 words in average (I know that smaller texts would be better, but for my application, I need them to remain as large as currently shown).
  • 4 labels are being recognized.

Then I use train command to start with the training, to suddenly stop with the following (meaningless) message:

---  ------  ------------  --------  ------  ------  ------  ------

BTW, when I reduce the training data to 5 datasets or less, training goes on normally. I was guessing some memory issue, and this post seems to confirm it; however that post does not explain clearly about what to do, to diagnose and confirm such problem (Ines suggestion only extends a list, while Guillaume quickly mentions psutil library to confirm a memory issue, however not showing any code snipet).

What should I do?

It's hard to say for sure, but given that one dataset less works fine I'm indeed guessing it's a memory issue.

Are you able to export the datasets to the .spacy format via the data-to-spacy recipe? If so, we might be able to pick it up from spaCy.

Hello @koaning ,

I got to realize a couple of things:

  1. data-to-spacy managed to build the .spacy files required for training in spaCy.
  2. When training the model however, I faced up again that Killed message.

Having the .spacy files already generated however, I decided to move to another cloud-based VM, with more computational resources in regards of memory this time... And the training completed successfully. That indirectly explains the root cause of my issue.

Still, it would be awesome to have some updated code snipet to diagnose this problem (i.e., a snipet which can tell if you are actually running short of memory or not for your training dataset[s]), and some suggestions to avoid this problem for "big" training datasets.

Thank you.

When you're training locally, you can pass a custom config.cfg file to train a spaCy model. This has a few parameters that might be worth exploring further. This allows you to pick smaller weights, which could help, but this setting might be most useful:

@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001

It could be that the batch size is too big for your machine. So you can set it to be much smaller.

Hello @koaning ,

Thank you, I'll put an eye to it. In general, is there any segment in the documentation, where all the parameters in the config.cfg file are detailed / explained? (i.e. in you code segment, which variable controls the batch size for instance?). The only piece of information I have found so far is this, this and more informally this, but even when they're nicely documented for aspects closely related with spaCy architecture, it misses some other aspects more related with the modeling itself.

It would be awesome to know if I am missing some other section in the doumentation, that could add more info regarding that.

Best regards.

I understand where you're coming from. The config.cfg can be a bit intimating just because there are so many settings in ML models these days.

I usually rely on Model Architectures section on the spaCy docs to understand the hyperparameters a bit better. There's some ideas for better educational content in this domain, but for now that part of the docs is the best reference for understanding all the settings.

1 Like

Hi @dave-espinosa

I am facing similar problem of memory while creating model with NER train.

My database with annotated data is relatively small, 200 MB. I have never used VM before for computing.

Can you share which cloud-based VM you used and how do you set to use in such environment?

If you can share your experience, that would be great help.


Hello @rahul1 ,

Allow me to answer your questions:

My company uses products related with Google Cloud Platform; in specific for VMs, we usually use either Compute Engine (my current choice for Prodigy) or Vertex AI Workbench. I think Google grants new users USD 300 in credits, which by own experience, allows you to run ~3 months worth of experiments (Important to say Google charges you by hourly rate, so my estimation might greatly fluctuate, depending on the intensity of your own experiments).

I used Prodigy official documentation, for Installation & Setup.

Hope it helps, and sorry about the delay!

Hi @dave-espinosa
Thank you very much for the information.
I will go for this.

1 Like

Hi @dave-espinosa,

It worked! Thanks for the help. I used Compute Engine with 64 GB RAM and ubuntu boot disk. Inside the VM instance ssh, the installation of prodigy is similar to any local ubuntu laptop.

gr. Rahul