Save trained model and add to a pretrained model

spacy
usage

(Timothy J Laurent) #1

Hi All,

I’m trying to figure out how to best save on space. We want to do periodic batch-training and persist the models to s3. I notice that when I run textcat.batch_train, it creates a new textcat spacy model and adds it to the text cat directory in the model next to the vocabulary directory that comes with the en_vectors_web_lg model. Since the vocabulary isn’t being changed, it would be nice to be able to omit the vocabulary from the persisted model and then add it back in when doing textcat.teach.

Is there a good way to mix and match the model sub-components like this?

Thanks,


(Matthew Honnibal) #2

Hey,

Yeah, it should be fairly easy to do what you want. You should be able to do either:

nlp.get_pipe('textcat').to_bytes(vocab=False) or nlp.get_pipe('textcat').to_disk(vocab=False), which should save out the textcat model without the vocab. This should let you checkpoint the model and then load it back later. The vectors data — which is the part that takes a lot of space — is immutable, so you don’t need to save it out every checkpoint.