Spent enormous time inside documentation trying to find how to speed up the training process for my labeled dataset
Components: ner
Merging training and evaluation data for 1 components
- [ner] Training: 63291 | Evaluation: 15822 (20% split)
Training: 63290 | Evaluation: 15822
training one epoch takes approx ~2hrs, and what's very strange it doesn't use even 15% of accessible CPU
prodigy train --ner nel_skills_large1 model --base-model en_core_web_md
Would love any suggestions on parameters (config.cfg), that can speed up the process (tried "batch_size", no luck)
2 hours per epoch definitely sounds pretty logng. Are you sure you're not running out of memory or disk space? You could run some profiling to take a look at what's particularly slow and whether memory is an issue.
Alternatively, you could also try streaming in your corpus by setting max_epochs = -1 if it's too large to fit into memory. See the second part of this section here for details: https://spacy.io/usage/training#custom-code-readers-batchers This would be slightly more involved, though, since you need to do your own shuffling and make sure all the labels are initialised, since the corpus isn't available in memory and spaCy can't just process it to read all available labels from it.