en_core_web_lg Sentence Tokenization with Minimal File Size

Hi,
I have developed a custom recipe similar to textcat.manual inside a Docker image. The main customization is that it creates tasks from documents in a MongoDB. The problem I have is that my prodigy image is quite large, and I have been asked to reduce the size. The part of the image that takes up the most space is the en_core_web_lg spacy model that I am using only to sentence tokenize.

My first idea to reduce image size was to use the en_core_web_sm model for tokenization, but I found that the en_core_web_sm and en_core_web_md models have different sentence tokenization results than the en_core_web_lg. I fear that using the small model for sentence tokenization during annotation and the large model for sentence tokenization during inference could negatively impact accuracy. My first question: Is this a valid concern?

Next, I tried disabling the 'tagger' and 'ner' and then deleting those portions of the model to save disk space. This reduced the size of the model, but not much. My second question: Do you have any recommendations for how I can achieve sentence tokenization equivalent to that generated by en_core_web_lg with minimal disk space usage?

My third question: Would you go about this differently? Any recommendations are appreciated.

Hi! The main difference that makes up for the model size is that the en_core_web_lg model includes word vectors. Those word vectors are used as features during training, which makes the model more accurate – but it also means you can't just remove the vectors, because then you'd end up with useless predictions.

In general, I'd say it's always good to keep that stuff in mind because it can matter. By default, spaCy models use the dependency parser for sentence segmentation, so a different dependency parser that may produce slightly different results can also lead to different sentence boundaries.

That said, the question is how relevant this would be for training a text classifier. That's probably something you have to experiment with. For NER, there's a more obvious risk, because the entity recognizer will not predict entity spans across sentence boundaries (which is a good thing, because we know that that's typically not correct). So in that case, differences in sentence boundaries could lead to a decrease in accuracy, especially around edge cases.

Some other ideas to try:

  • How does the en_core_web_md model perform? That model also includes word vectors, but fewer than the lg model, so it's also smaller (48 MB vs. 746 MB).
  • How does a rule-based strategy (e.g. spaCy's sentencizer, possibly with some custom rules) perform in comparison? This would be the most consistent, portable and lightweight option.
1 Like