Save trained model and add to a pretrained model

Hi All,

I'm trying to figure out how to best save on space. We want to do periodic batch-training and persist the models to s3. I notice that when I run textcat.batch_train, it creates a new textcat spacy model and adds it to the text cat directory in the model next to the vocabulary directory that comes with the en_vectors_web_lg model. Since the vocabulary isn't being changed, it would be nice to be able to omit the vocabulary from the persisted model and then add it back in when doing textcat.teach.

Is there a good way to mix and match the model sub-components like this?

Thanks,

Hey,

Yeah, it should be fairly easy to do what you want. You should be able to do either:

nlp.get_pipe('textcat').to_bytes(vocab=False) or nlp.get_pipe('textcat').to_disk(vocab=False), which should save out the textcat model without the vocab. This should let you checkpoint the model and then load it back later. The vectors data — which is the part that takes a lot of space — is immutable, so you don’t need to save it out every checkpoint.

Thanks @honnibal,

This works well. The part I'm having trouble with is loading the pipe back into the model.

I tried:

nlp.from_disk(<path to model>)

But I'm seeing an error:

{ValueError}Can't read file: /var/folders/7s/chb7w54n6g79dr448sp625j40000gp/T/tmp0ax88qjj/meta.json

Any help would be greatly appreciated.

PS we'd love to be part of the prodigy-scale beta so I don't have to muck around with this stuff.

So I think I have a working solution:

# so save the model to disk

nlp.get_pipe('textcat').to_disk(model_path, exclude=['vocab'])

# now create a new model with vocab

 nlp = spacy.load(input_model)

# add the sentencizer that TextClassifier adds 
sentencizer = nlp.create_pipe('sentencizer')
nlp.add_pipe(sentencizer)

# create a new textcat pipe with vocab
textcat = nlp.create_pipe('textcat')
# add the model and cfg from the dumped model        
textcat.from_disk(model_path, exclude=['vocab'])
# add pipe to language model
nlp.add_pipe(textcat)

Does this approach look correct?

Yes, that looks good :+1: