Why getting better result in textcat-multilabel than textcat?

We are trying to train a model (exclusive categories) for sentiment analysis.
Our dataset would contain something like: "accept":["anxious"], "accept":["happy"], "accept":["upset"], or "accept":["neutral"]

We are getting a very low score of between 0.3-0.35. When I tried to change to "--textcat-multilabel" (non-exclusive categories), we are getting up to 0.70. Why is this so?

Thanks.

Hi Joe.

Could you clarify what you mean by "score"? I'm assuming you're referring to the column in the training table, but feel free to correct me if you're referring to something else.

The score that you see there is usually an average of a mixture of metrics but it's hard to use as a comparative measure between tasks. Just like it'll be hard to compare the "score" from an NER pipeline to the score of a SPANCAT pipeline ... it's also hard to make a meaningful comparison between the scores of a textcat and textcat-multilabel pipeline because these are different tasks. Metrics like "accuracy" are calculated differently for exclusive categories than for non-exclusive categories.

Does this help?

More info

The spaCy Scorer API docs give a bit more information on this topic.

You can also read more about these scores and how they are configurable here:

Yes, I am referring to that score. But why does it get better with the textcat-multilabel? We are using a custom recipe in prodigy, but we assumed that "accept" is an array for both textcat and textcat-multilabel. Is that right?

It might help if you could share the tables from both runs. But it might also help to think of the score more as a side effect of the algorithm than as something that is defined from just the data.

If a learning algorithm assumes exclusive labels then it will use an architecture/loss function to reflect that. Another learning algorithm that assumes non-exclusive labels will have a different activation layer/loss function ... and as a result will also report a different number as the loss. Does this make sense?

Hi Vincent.
Here are the screenshots for both runs.

We are just wondering since they are both text-classification tasks, and why a textcat-multilabel performs better. :slightly_frowning_face:

Here is our textcat.cfg:
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["textcat"]
batch_size = 1000
disabled =
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.textcat]
factory = "textcat"
scorer = {"@scorers":"spacy.textcat_scorer.v2"}
threshold = 0.0

[components.textcat.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat.model.linear_model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = true
ngram_size = 1
no_output_layer = false

[corpora]
@readers = "prodigy.MergedCorpus.v1"
eval_split = 0.2
sample_size = 1.0

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components =
annotating_components =
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
cats_score = 1.0
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Here is our textcat-multilabel.cfg:
[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = null
seed = 0

[nlp]
lang = "en"
pipeline = ["textcat_multilabel"]
batch_size = 1000
disabled =
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.textcat_multilabel]
factory = "textcat_multilabel"
scorer = {"@scorers":"spacy.textcat_multilabel_scorer.v1"}
threshold = 0.5

[components.textcat_multilabel.model]
@architectures = "spacy.TextCatEnsemble.v2"
nO = null

[components.textcat_multilabel.model.linear_model]
@architectures = "spacy.TextCatBOW.v2"
exclusive_classes = false
ngram_size = 1
no_output_layer = false

[components.textcat_multilabel.model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"

[components.textcat_multilabel.model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 64
rows = [2000, 2000, 1000, 1000, 1000, 1000]
attrs = ["ORTH", "LOWER", "PREFIX", "SUFFIX", "SHAPE", "ID"]
include_static_vectors = false

[components.textcat_multilabel.model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = ${components.textcat_multilabel.model.tok2vec.embed.width}
window_size = 1
maxout_pieces = 3
depth = 2

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components =
annotating_components =
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null

[training.batcher.size]
@schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001

[training.score_weights]
cats_score = 1.0
cats_score_desc = null
cats_micro_p = null
cats_micro_r = null
cats_micro_f = null
cats_macro_p = null
cats_macro_r = null
cats_macro_f = null
cats_macro_auc = null
cats_f_per_type = null
cats_macro_auc_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Have you seen this score section in the spaCy docs? This bit is specific for the textcat tasks.

To quote the page:

The reported {attr}_score depends on the classification properties:

  • binary exclusive with positive label: {attr}_score is set to the F-score of the positive label
  • 3+ exclusive classes, macro-averaged F-score: {attr}_score = {attr}_macro_f
  • multilabel, macro-averaged AUC: {attr}_score = {attr}_macro_auc

So in your case, one pipeline is running exclusive labels, which means that it would return the macro F1 score while the non-exclusive setting (multilabel) would report AUC. Because the two tasks are different, spaCy assumes different metrics to report during training.

Does this help?

Ok. So do you suggest that on actual prediction, we should just use the trained textcat-multilabel model so we get better results? We are not sure how to improve the model. :frowning:

You should use the model that best fits the task. The choice of the pipeline component is less of a hyperparameter and more of a modelling decision.

For example, if you have a classification task where you are absolutely sure that the label belongs to one, and only one (!), of the classes then you want to use a model with exclusive categories. An example could be sentiment labels such as: POSITIVE, NEUTRAL, NEGATIVE. In this case you would have sentences where the model only needs to predict one of the labels, but it must always select one of them.

However, if you have a classification task where there are multiple labels but they are not exclusive, then you'll pick a different model. An example could be labels could be newspaper tags like: AI,TECH,POLICY,ENVIRONMENT. An article can be both about AI and TECH, so the model needs to be able to predict that.

If you want to improve your existing model, the first step is usually to understand when it makes errors. Did you have a look at that? In general, if you can spot a specific kind of issue then you may try to proceed by annotating more examples similar to the ones that it got wrong.

I understand. Thank you. Just a follow up question, do you think that using transformers or even LLM for training might improves the results/scores?

Thank you for your responses.

You can consider other types of models because you seem to have a reasonable amount of training data. However, it's probably best to understand what kinds of errors your models makes before looking for a model improvement.

Are there two classes that are particularly hard to distinguish? Does your model overfit on one of the classes? Are you 100% sure that all of your labels are correct? Are you able to confirm that by having multiple annotators look at it? That sort of thing.

Personally, I've always found that being able to answer these questions usually leads me to an improvement.

Thank you. One last follow-up. Are you aware of any spacy/prodigy project that is able to detect emotion in audio (not the text/transcription, but the actual audio data)?

Not that I am aware of.