Scorer for Text Classification

I try to evaluate my multilabel textcat with Scorer, but the scorer.scores don't return anything.

def evaluate(nlp, test_data):
    scorer = Scorer()
    for text, label in test_data:
        # text: "my example text"
        # label: {"cats:{"cat1":0.0, "cat2":1.0, "cat3":0.0}}
        doc_gold_text = nlp.make_doc(text)
        gold = GoldParse(doc_gold_text, cats=label["cats"])
        pred_value = nlp(text)
        scorer.score(pred_value, gold)

    return score.scores
    # return: {}
    # actually, also other fields return zero

The model works well to predict and I use this same function to eval NER in other model with success. Do I something wrong in this case? Score don't work to textcat?

I don't know what's wrong with the code above, but I tried a different approach and it's worked! Using nlp.evaluate():

def evaluate(nlp, test_data):
    eval_input = [(nlp.make_doc(text), GoldParse(nlp.make_doc(text), cats=label["cats"])) for text, label in test_data]
    scorer = nlp.evaluate(eval_input)
    return scorer.scores
1 Like

Hi @ines

We're trying to solve a multi-label text classification problem where prodigy has been used for annotation. There'are 10 classes and almost 25% are +ve samples (have one or more labels). We've trained model using SpaCy cli command with en_vectors_web_lg as our base model.

Our goal is to aggregate (sum/avg) all scores across 100 inferences and arrive at relative aggregate scores across the 10 classes.
However, there is a major variance in scoring which is causing problems:

  1. The HIGH score for two different positive samples for the same class are very different. For sample one, it is "label A": 0.177, and for another sample it is "label A": 0.667. Why so much variance? Do we need to normalize?

  2. Also the order-of-magnitude of LOW scores for different samples varies so much - ranging from 10e-1 to 10e-3. Again - Why so much variance? Do we need to normalize?

I have attached a screen shot below

  1. My last question is for the model architecture. Does SpaCy use sigmoid activation function for classification of multi-label classes?


Hi @ines,

This is a follow up to Mayanks post - "major variance in scoring which is causing problems"
Any suggestions how we can fix the variance in scoring?
We were expecting softmax scores - where the multi-class scores/probabilities total to 1 - which is not the case. And we don't have insights into the architecture of the model.

Thanks for your advice.


Hi Kapil and Mayank,

You said you have a "multi-label" textcat problem, does that mean that in fact, one sample text can be annotated with multiple positive labels? Because in that case, the output probabilities wouldn't sum up to 1 - the different labels would be seen as "parallel" classification challenges.

If, however, you have a "multi-class" textcat problem but only 1 class can be positive per sample, we'd need to set exclusive_classes to True in your textcat model, and then a Softmax output layer would indeed be used (

Can you share which exact script you're running to perform the text classification, and what exactly the parameters of your challenge are?