Spacy TextCat: Training time increased on minor increase in training instances.

javiyaparth · January 23, 2020, 7:40am

I have been using spacy's textcat pipeline to classify a string into one of the 19,000 resultant classes. I have been tweaking my dataset and performing iteration using the 'simple_cnn' ensemble, such that I can achieve better results.

Initially I started training with 15 million training instances (with duplicates) which had unbalanced classes, for this dataset the per epoch time was ~15hrs and the training accuracy achieved was 86% after 8 epochs.

To improve the performance further, I created a new dataset consisting 19 million training instances with no/minimal duplication and also balanced class distribution, for this dataset the per epoch time was increased to ~30hrs and the training accuracy achieved was no more than 38% after 8 epochs.

Can you suggest some way to increase our speed of iteration and also any probable cause of such reduction in accuracy ? We have been using CPU for training, we also tried GPU but that was even slower than training on CPU.

honnibal · February 1, 2020, 11:07am

There are always trade-offs in every implementation, such that different dataset sizes will perform relatively well or relatively poorly. spaCy's textcat leans more towards smaller dataset sizes than your problem.

For large textcat tasks, the best tool to use is Vowpal Wabbit: https://github.com/VowpalWabbit/vowpal_wabbit/wiki . You can probably train your model to completion with VW in a few minutes per run. This should also lead you to better accuracies, as you'll be able to experiment more extensively.

Topic		Replies	Views
Slow training on multilabel textcats usage , textcat , spacy	9	842	November 19, 2021
Traning/validation in Textcat/ textcat , spacy , off-topic	0	1183	May 26, 2020
Best practices & realistic expectations with high number of classes for multiclass text classification task usage , textcat , spacy	2	1152	August 27, 2019
Imbalanced classes in a multiclass textcat leads to completely biased predictions usage , textcat	7	4024	February 21, 2018
Prodigy textcat train optimization?? usage , textcat , spacy	3	542	March 23, 2020

Spacy TextCat: Training time increased on minor increase in training instances.

Related topics