Prodigy NER train recipe getting killed for no apparent reason

Hello everyone,

I am about to train a NER model in Prodigy, for which I have 6 datasets available (they could be more), obtained through ner.manual. Some aspects about this data:

  • Each file has 1000 samples.
  • The text in each sample is "lengthy": has ~3500 characters or ~450 words in average (I know that smaller texts would be better, but for my application, I need them to remain as large as currently shown).
  • 4 labels are being recognized.

Then I use train command to start with the training, to suddenly stop with the following (meaningless) message:

E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  --------  ------  ------  ------  ------
Killed

BTW, when I reduce the training data to 5 datasets or less, training goes on normally. I was guessing some memory issue, and this post seems to confirm it; however that post does not explain clearly about what to do, to diagnose and confirm such problem (Ines suggestion only extends a list, while Guillaume quickly mentions psutil library to confirm a memory issue, however not showing any code snipet).

What should I do?

It's hard to say for sure, but given that one dataset less works fine I'm indeed guessing it's a memory issue.

Are you able to export the datasets to the .spacy format via the data-to-spacy recipe? If so, we might be able to pick it up from spaCy.

Hello @koaning ,

I got to realize a couple of things:

  1. data-to-spacy managed to build the .spacy files required for training in spaCy.
  2. When training the model however, I faced up again that Killed message.

Having the .spacy files already generated however, I decided to move to another cloud-based VM, with more computational resources in regards of memory this time... And the training completed successfully. That indirectly explains the root cause of my issue.

Still, it would be awesome to have some updated code snipet to diagnose this problem (i.e., a snipet which can tell if you are actually running short of memory or not for your training dataset[s]), and some suggestions to avoid this problem for "big" training datasets.

Thank you.

When you're training locally, you can pass a custom config.cfg file to train a spaCy model. This has a few parameters that might be worth exploring further. This allows you to pick smaller weights, which could help, but this setting might be most useful:

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001

It could be that the batch size is too big for your machine. So you can set it to be much smaller.

Hello @koaning ,

Thank you, I'll put an eye to it. In general, is there any segment in the documentation, where all the parameters in the config.cfg file are detailed / explained? (i.e. in you code segment, which variable controls the batch size for instance?). The only piece of information I have found so far is this, this and more informally this, but even when they're nicely documented for aspects closely related with spaCy architecture, it misses some other aspects more related with the modeling itself.

It would be awesome to know if I am missing some other section in the doumentation, that could add more info regarding that.

Best regards.

I understand where you're coming from. The config.cfg can be a bit intimating just because there are so many settings in ML models these days.

I usually rely on Model Architectures section on the spaCy docs to understand the hyperparameters a bit better. There's some ideas for better educational content in this domain, but for now that part of the docs is the best reference for understanding all the settings.