I'm trying to figure out how to best save on space. We want to do periodic batch-training and persist the models to s3. I notice that when I run textcat.batch_train, it creates a new textcat spacy model and adds it to the text cat directory in the model next to the vocabulary directory that comes with the en_vectors_web_lg model. Since the vocabulary isn't being changed, it would be nice to be able to omit the vocabulary from the persisted model and then add it back in when doing textcat.teach.
Is there a good way to mix and match the model sub-components like this?
Yeah, it should be fairly easy to do what you want. You should be able to do either:
nlp.get_pipe('textcat').to_bytes(vocab=False) or nlp.get_pipe('textcat').to_disk(vocab=False), which should save out the textcat model without the vocab. This should let you checkpoint the model and then load it back later. The vectors data — which is the part that takes a lot of space — is immutable, so you don’t need to save it out every checkpoint.
# so save the model to disk
nlp.get_pipe('textcat').to_disk(model_path, exclude=['vocab'])
# now create a new model with vocab
nlp = spacy.load(input_model)
# add the sentencizer that TextClassifier adds
sentencizer = nlp.create_pipe('sentencizer')
nlp.add_pipe(sentencizer)
# create a new textcat pipe with vocab
textcat = nlp.create_pipe('textcat')
# add the model and cfg from the dumped model
textcat.from_disk(model_path, exclude=['vocab'])
# add pipe to language model
nlp.add_pipe(textcat)