Hi, this is a very basic question from new users: Your TextCat tutorial example on github documentation starts from a blank model for the batch-train operation (Text Classification · Prodigy · An annotation tool for AI, Machine Learning & NLP). My colleagues and I were wondering why you don't recommend starting from, e.g., a basic english model like en_core_web_sm? (I tried it both ways and didn't get a reliable difference by eyeballing, but perhaps you can explain why it's not helpful?)
We did it that way in the tutorial to help people generalize to cases where they’re using a language that we don’t have a pre-trained model for. For instance, if you’re classifying text in Russian, the textcat model will probably work fine – but we don’t have a pre-trained model in spaCy for Russian yet.
The en_core_web_sm model doesn’t have pre-trained vectors, so yes, the textcat model should perform the same as from a blank model. However, the models with pre-trained vectors like en_core_web_md should make the textcat somewhat better.