I have developed a custom recipe similar to textcat.manual inside a Docker image. The main customization is that it creates tasks from documents in a MongoDB. The problem I have is that my prodigy image is quite large, and I have been asked to reduce the size. The part of the image that takes up the most space is the en_core_web_lg spacy model that I am using only to sentence tokenize.
My first idea to reduce image size was to use the en_core_web_sm model for tokenization, but I found that the en_core_web_sm and en_core_web_md models have different sentence tokenization results than the en_core_web_lg. I fear that using the small model for sentence tokenization during annotation and the large model for sentence tokenization during inference could negatively impact accuracy. My first question: Is this a valid concern?
Next, I tried disabling the 'tagger' and 'ner' and then deleting those portions of the model to save disk space. This reduced the size of the model, but not much. My second question: Do you have any recommendations for how I can achieve sentence tokenization equivalent to that generated by en_core_web_lg with minimal disk space usage?
My third question: Would you go about this differently? Any recommendations are appreciated.