Spacy TextCat: Training time increased on minor increase in training instances.

I have been using spacy's textcat pipeline to classify a string into one of the 19,000 resultant classes. I have been tweaking my dataset and performing iteration using the 'simple_cnn' ensemble, such that I can achieve better results.

Initially I started training with 15 million training instances (with duplicates) which had unbalanced classes, for this dataset the per epoch time was ~15hrs and the training accuracy achieved was 86% after 8 epochs.

To improve the performance further, I created a new dataset consisting 19 million training instances with no/minimal duplication and also balanced class distribution, for this dataset the per epoch time was increased to ~30hrs and the training accuracy achieved was no more than 38% after 8 epochs.

Can you suggest some way to increase our speed of iteration and also any probable cause of such reduction in accuracy ? We have been using CPU for training, we also tried GPU but that was even slower than training on CPU.

There are always trade-offs in every implementation, such that different dataset sizes will perform relatively well or relatively poorly. spaCy's textcat leans more towards smaller dataset sizes than your problem.

For large textcat tasks, the best tool to use is Vowpal Wabbit: https://github.com/VowpalWabbit/vowpal_wabbit/wiki . You can probably train your model to completion with VW in a few minutes per run. This should also lead you to better accuracies, as you'll be able to experiment more extensively.