Batch.train textcat example- why start w/ a blank model, not (e.g.) en_core_web_sm?

Hi, this is a very basic question from new users: Your TextCat tutorial example on github documentation starts from a blank model for the batch-train operation (https://prodi.gy/docs/workflow-text-classification). My colleagues and I were wondering why you don’t recommend starting from, e.g., a basic english model like en_core_web_sm? (I tried it both ways and didn’t get a reliable difference by eyeballing, but perhaps you can explain why it’s not helpful?)

E.g.:

textcat.batch-train sexy_pics_request en_core_web_sm --output sexy_pics_request.model

Thanks, Lynn

We did it that way in the tutorial to help people generalize to cases where they’re using a language that we don’t have a pre-trained model for. For instance, if you’re classifying text in Russian, the textcat model will probably work fine – but we don’t have a pre-trained model in spaCy for Russian yet.

The en_core_web_sm model doesn’t have pre-trained vectors, so yes, the textcat model should perform the same as from a blank model. However, the models with pre-trained vectors like en_core_web_md should make the textcat somewhat better.

Thank you, this is very useful. I wondered about the lack of vectors!

1 Like