Reusing Base Model Parts to Save Space Across multiple Classifiers

Eliotfrazier · July 20, 2020, 10:56am

Hi,

I was wondering whether it would be possible to get some advice on saving memory/space efficient by reusing model parts across multiple related text classifiers.

I recently trained a set of models in the same domain (trying to classify unusual employment statuses from job titles), each using en_core_web_sm. These models were intended to semantically discriminate out false positives from a sub-string extraction. I want to keep these models separate, rather than turn them into a multi-classifier, as they've all been trained on a narrow pre-filtered selection of input data points, and therefore perform badly when this is widened.

However it would be amazing if the four output models I've trained could reuse common pieces through the same spacy loader: the parser, the tagger etc, as this would reduce the memory and storage demands by ~ 4x. I was formerly advised this could be done by using the first output model as the spacy_model argument to train the subsequent models as below, but this hasn't seemed to make the model directories any smaller.

python -m prodigy train textcat grads_final en_core_web_sm --output grad_model --eval-split 0.15
python -m prodigy train textcat intern_final grad_model --output intern_model --eval-split 0.15
python -m prodigy train textcat contractor_final grad_model --output contractor_model --eval-split 0.
python -m prodigy train textcat trainee_final grad_model --output trainee_model --eval-split 0.15

Any suggestions appreciated ..

ines · July 21, 2020, 8:58am

How large is your model and what takes up the most space? The en_core_web_sm model and its components should be very small – the only thing that usually makes a difference in terms of size are word vectors.

The approach here will use the base model and add a text classification component to it, or update the text classifier if it's already available. So I'm not 100% sure that workflow does what you want? Because you're essentially updating the same classifier multiple times with different data.

Eliotfrazier · July 22, 2020, 9:45am

Yes, they are all fairly small, each about 18.5 Mb. For each of the models the parser, ner, and tagger combined take up ~ 11.5 Mb, and unless I'm mistaken are identical? I guess ultimately it would be nice if I was able to reuse these parts across the four models?

Topic		Replies	Views
Save trained model and add to a pretrained model usage , textcat , spacy , solved	4	1508	September 19, 2019
Use textcat and textcat_multilabel in the same model textcat , spacy	1	347	May 19, 2022
Multiple, separate text classifications in a single model usage , textcat , solved	12	2885	September 28, 2021
Do the outputted models using textcat.batch-train make use of word vectors? usage , textcat , spacy	2	595	March 28, 2019
Textcat model with multiple classes usage , textcat	5	1536	November 1, 2019

Reusing Base Model Parts to Save Space Across multiple Classifiers

Related topics