How does the Spacy language model classify before any human annotation?

I'm using the Spacy lg language model as the active learning model for my text classification annotation. I want to know know it works.

The language model wasn't trained with my classification task. How does the model do the classification? Does it try to do the classification using the label I am providing? If that is the case, does it imply that I should choose a meaningful label for my classification problem? Or does the model just randomly choose any instance?


When you call nlp.begin_training, the model weights are initialized randomly. So before you update the model with your examples, it will predict something completely abitrary, based on the random weights. The label names have no impact – but of course, whether you have one or five labels and whether they're mutually exclusive makes a difference.

If you're training a model from scratch with a recipe like textcat.teach, the main difficulty is to get over the "cold start problem" and have the model make more meaningful suggestions that you can interact with. For that, you need to update it with enough positive and negative examples. That's usually where the --patterns come in handy – they pre-select examples based on trigger words and phrases so you can start off with enough positive examples to update the model in the loop.

Thanks. That's what we're thinking - use patterns to bootstrap the training process.

Follow up question. In order to get a stable performance, can we initialize the model with the same weights, instead of random weights?