Best way to create a model for sentimental analysis

I´m working on a sentimental analysis using textcat and have a few questions.

I have a dataset with 5500 annotations labeled with POSITIVE and NEGATIVE.

I exported the dataset and for each positive label i add the same row as negative reject. And vice versa.

I have created a swedish model that I use as base model for training.
I use it when I annotate and also when I train a new model using textcat.batchtrain. I always output to a new model.

My questions:

  1. In the forum you are talking about train from a fresh model. Does my swedish base model act as fresh model or do I need to create a total new one, en empty one(nlp-to-disk), to train from?

  2. My labels are POSITIVE and NEGATIVE with uppercase. In my data manipulation script where I add a negative reject row for each positive row I added a new label with lowercase negative, by mistake. The outputed model ended up with 4 labels and I got a much better result on the new lowercase labels than the annotated uppercase. Is there any explanation behind this? Is anything stored in the “from model” during annotation or batch train?


If your model already has weights for text classification, then yeah I would recommend starting from a new model, rather than resuming training. It’s better to train from random weights each time instead of resuming from the previous training, because it’s a bit easier to reason about, and you might avoid overfitting better. The other thing you might want to do is download some Swedish vectors from here: . You can use these to initialise a model with spacy init-model. Pretrained vectors are likely to be pretty helpful for your problem.

The situation you describe with the four labels is very confusing! I’m not sure what could be going on there. If you keep finding the same thing — that this weird doubling of the labels improves the scores — I’d be curious to dig a little deeper.