Save trained model and add to a pretrained model

timothyjlaurent · February 5, 2019, 12:48am

Hi All,

I'm trying to figure out how to best save on space. We want to do periodic batch-training and persist the models to s3. I notice that when I run textcat.batch_train, it creates a new textcat spacy model and adds it to the text cat directory in the model next to the vocabulary directory that comes with the en_vectors_web_lg model. Since the vocabulary isn't being changed, it would be nice to be able to omit the vocabulary from the persisted model and then add it back in when doing textcat.teach.

Is there a good way to mix and match the model sub-components like this?

Thanks,

honnibal · February 6, 2019, 8:10am

Hey,

Yeah, it should be fairly easy to do what you want. You should be able to do either:

nlp.get_pipe('textcat').to_bytes(vocab=False) or nlp.get_pipe('textcat').to_disk(vocab=False), which should save out the textcat model without the vocab. This should let you checkpoint the model and then load it back later. The vectors data — which is the part that takes a lot of space — is immutable, so you don’t need to save it out every checkpoint.

timothyjlaurent · September 18, 2019, 1:07am

Thanks @honnibal,

This works well. The part I'm having trouble with is loading the pipe back into the model.

I tried:

nlp.from_disk(<path to model>)

But I'm seeing an error:

{ValueError}Can't read file: /var/folders/7s/chb7w54n6g79dr448sp625j40000gp/T/tmp0ax88qjj/meta.json

Any help would be greatly appreciated.

PS we'd love to be part of the prodigy-scale beta so I don't have to muck around with this stuff.

timothyjlaurent · September 18, 2019, 10:11pm

So I think I have a working solution:

# so save the model to disk

nlp.get_pipe('textcat').to_disk(model_path, exclude=['vocab'])

# now create a new model with vocab

 nlp = spacy.load(input_model)

# add the sentencizer that TextClassifier adds 
sentencizer = nlp.create_pipe('sentencizer')
nlp.add_pipe(sentencizer)

# create a new textcat pipe with vocab
textcat = nlp.create_pipe('textcat')
# add the model and cfg from the dumped model        
textcat.from_disk(model_path, exclude=['vocab'])
# add pipe to language model
nlp.add_pipe(textcat)

Does this approach look correct?

ines · September 19, 2019, 9:57am

Yes, that looks good

Topic		Replies	Views
Load error after adding custom textcat model to the pipeline textcat , spacy	7	2082	June 26, 2019
Outdated Documentation and trouble loading textcat model textcat , spacy	4	821	August 15, 2018
Reusing Base Model Parts to Save Space Across multiple Classifiers usage , textcat , spacy	2	383	July 22, 2020
using saved texcat trained model for new data set usage , textcat , spacy , solved	1	429	May 1, 2020
Do the outputted models using textcat.batch-train make use of word vectors? usage , textcat , spacy	2	595	March 28, 2019

Save trained model and add to a pretrained model

Related topics