In our previous iterations, we were always running "prodigy train" without using a base model for text classification (multi-label), and we got to as high as .82 as the score. When I opted to use this previously trained model-last as the --base-model for our latest iteration, we got to .90+. Is this a good idea? Or is this overfitting? We are using spacy.TextCatEnsemble.v2 as the architecture.
Great question! In general, we recommend against this:
Yes - it's typically better to train a model from scratch, using the same full corpus (instead of updating the same artifact over and over again, which often makes it much harder to avoid overfitting and forgetting effects).
The one thing you may want to do (if you're not already) is to modify your base-model with different vectors (e.g., en_core_web_md or en_core_web_lg). There's a bit about it in the docs:
Using pretrained word embeddings to initialize your model is easy and can make a big difference. If you’re using spaCy, try using the en_core_web_lg model as the base model. If you’re working with domain-specific texts, you can train your own vectors and create a base model with them.
Thank you for informing me that using the previously trained model as the base model is not a good idea.
I tried to use en_core_web_md and en_core_web_lg as base model, but it doesn't seem to improve the textcat_multilabel as compared to without using a base model.