I am about to train a NER model in Prodigy, for which I have 6 datasets available (they could be more), obtained through ner.manual. Some aspects about this data:
Each file has 1000 samples.
The text in each sample is "lengthy": has ~3500 characters or ~450 words in average (I know that smaller texts would be better, but for my application, I need them to remain as large as currently shown).
4 labels are being recognized.
Then I use train command to start with the training, to suddenly stop with the following (meaningless) message:
E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE
--- ------ ------------ -------- ------ ------ ------ ------
Killed
BTW, when I reduce the training data to 5 datasets or less, training goes on normally. I was guessing some memory issue, and this post seems to confirm it; however that post does not explain clearly about what to do, to diagnose and confirm such problem (Ines suggestion only extends a list, while Guillaume quickly mentions psutil library to confirm a memory issue, however not showing any code snipet).
data-to-spacy managed to build the .spacy files required for training in spaCy.
When training the model however, I faced up again that Killed message.
Having the .spacy files already generated however, I decided to move to another cloud-based VM, with more computational resources in regards of memory this time... And the training completed successfully. That indirectly explains the root cause of my issue.
Still, it would be awesome to have some updated code snipet to diagnose this problem (i.e., a snipet which can tell if you are actually running short of memory or not for your training dataset[s]), and some suggestions to avoid this problem for "big" training datasets.
When you're training locally, you can pass a custom config.cfg file to train a spaCy model. This has a few parameters that might be worth exploring further. This allows you to pick smaller weights, which could help, but this setting might be most useful:
Thank you, I'll put an eye to it. In general, is there any segment in the documentation, where all the parameters in the config.cfg file are detailed / explained? (i.e. in you code segment, which variable controls the batch size for instance?). The only piece of information I have found so far is this, this and more informally this, but even when they're nicely documented for aspects closely related with spaCy architecture, it misses some other aspects more related with the modeling itself.
It would be awesome to know if I am missing some other section in the doumentation, that could add more info regarding that.
I understand where you're coming from. The config.cfg can be a bit intimating just because there are so many settings in ML models these days.
I usually rely on Model Architectures section on the spaCy docs to understand the hyperparameters a bit better. There's some ideas for better educational content in this domain, but for now that part of the docs is the best reference for understanding all the settings.
My company uses products related with Google Cloud Platform; in specific for VMs, we usually use either Compute Engine (my current choice for Prodigy) or Vertex AI Workbench. I think Google grants new users USD 300 in credits, which by own experience, allows you to run ~3 months worth of experiments (Important to say Google charges you by hourly rate, so my estimation might greatly fluctuate, depending on the intensity of your own experiments).
It worked! Thanks for the help. I used Compute Engine with 64 GB RAM and ubuntu boot disk. Inside the VM instance ssh, the installation of prodigy is similar to any local ubuntu laptop.