I have been using spacy's textcat pipeline to classify a string into one of the 19,000 resultant classes. I have been tweaking my dataset and performing iteration using the 'simple_cnn' ensemble, such that I can achieve better results.
Initially I started training with 15 million training instances (with duplicates) which had unbalanced classes, for this dataset the per epoch time was ~15hrs and the training accuracy achieved was 86% after 8 epochs.
To improve the performance further, I created a new dataset consisting 19 million training instances with no/minimal duplication and also balanced class distribution, for this dataset the per epoch time was increased to ~30hrs and the training accuracy achieved was no more than 38% after 8 epochs.
Can you suggest some way to increase our speed of iteration and also any probable cause of such reduction in accuracy ? We have been using CPU for training, we also tried GPU but that was even slower than training on CPU.