Batch.train textcat example- why start w/ a blank model, not (e.g.) en_core_web_sm?

arnicas · September 3, 2018, 3:51pm

Hi, this is a very basic question from new users: Your TextCat tutorial example on github documentation starts from a blank model for the batch-train operation (Text Classification · Prodigy · An annotation tool for AI, Machine Learning & NLP). My colleagues and I were wondering why you don't recommend starting from, e.g., a basic english model like en_core_web_sm? (I tried it both ways and didn't get a reliable difference by eyeballing, but perhaps you can explain why it's not helpful?)

E.g.:

textcat.batch-train sexy_pics_request en_core_web_sm --output sexy_pics_request.model

Thanks, Lynn

honnibal · September 5, 2018, 7:29pm

We did it that way in the tutorial to help people generalize to cases where they’re using a language that we don’t have a pre-trained model for. For instance, if you’re classifying text in Russian, the textcat model will probably work fine – but we don’t have a pre-trained model in spaCy for Russian yet.

The en_core_web_sm model doesn’t have pre-trained vectors, so yes, the textcat model should perform the same as from a blank model. However, the models with pre-trained vectors like en_core_web_md should make the textcat somewhat better.

arnicas · September 6, 2018, 8:31am

Thank you, this is very useful. I wondered about the lack of vectors!

Topic		Replies	Views
Spacy pretrain best practices usage , done , spacy	16	5281	March 13, 2020
Unable to train textcat model using en_core_web_md as a base model textcat	11	1683	May 2, 2023
textcat.teach model init: db-based or session-only? usage , textcat	2	378	November 6, 2019
Training, pretraining best practices and deeper understanding usage , best-practices	3	967	October 24, 2019
Help needed to get started with text classification usage , textcat	10	3519	January 14, 2019

Batch.train textcat example- why start w/ a blank model, not (e.g.) en_core_web_sm?

Related topics