I am shipping multiple models for different datasets. I train most of the models from existing en_core_web_lg which is around 800M. I see that the vectors alone occupy 600M in this size. So the rest of the model of CNN should be occupying 200M along with other bookkeeping information. Is this right?
I would like to save the model separately from the vectors. Since I am always basing my models on the same vectors I want to ship one copy of the vectors (which is already available in the site) and ship different models of around 200M each. That way at runtime I can inform spacy to use the vectors from the single copy of vectors file and load models from whatever is relevant for that data. Is there a way to do that in prodigy and spacy?
This is currently an awkward thing with spaCy that I’m very interested in fixing. I’d like to introduce a concept of “frozen” vectors that can then be held globally — imo global state is fine so long as it’s global immutable state. Then we could just have the model refer to the name of the vectors that should be used when we serialize.
At the moment the best option is to exclude the vocab from the serialization, like this:
I have one concern though. When the readonly memory copy is implemented for vectors it will only make sense for models which are all trained on same vectors. If I have a mix like en_core_web_lg and my custom trained model (customized over en_core_web_lg) on new terms, the read-only copy will not be useful for my custom trained model. Then I might be forced to have custom trained model’s vectors separately. However I see that new terms in custom trained model might be very few compared to what is already there. Anyway to optimize on that aspect when the readonly copy is implemented? It may be cumbersome though to code for such optimization where a core set of vectors is shared and new vectors are referred from their own copy (which does not contain core set of vectors all over again).
I am answering my own question after some more analysis of the vectors. When the new model was trained over en_core_web_lg all the new terms were given a unique random vector (unknown vector). When an existing model encounters a new word it assigns the same unknown vector to it. So regardless of new terms given during model training (retraining) the end effect will be the same. So the concern mentioned above will not apply in such cases.
And big thanks for the tip @honnibal. Vocab sharing worked the way you suggested. Now my process size is only about 2GB even after loading multiple models with shared vocab.