I try to evaluate my multilabel textcat with Scorer, but the scorer.scores don't return anything.

def evaluate(nlp, test_data):
scorer = Scorer()
for text, label in test_data:
# text: "my example text"
# label: {"cats:{"cat1":0.0, "cat2":1.0, "cat3":0.0}}
doc_gold_text = nlp.make_doc(text)
gold = GoldParse(doc_gold_text, cats=label["cats"])
pred_value = nlp(text)
scorer.score(pred_value, gold)
return score.scores
# return: {}
# actually, also other fields return zero

The model works well to predict and I use this same function to eval NER in other model with success. Do I something wrong in this case? Score don't work to textcat?

We're trying to solve a multi-label text classification problem where prodigy has been used for annotation. There'are 10 classes and almost 25% are +ve samples (have one or more labels). We've trained model using SpaCy cli command with en_vectors_web_lg as our base model.

Our goal is to aggregate (sum/avg) all scores across 100 inferences and arrive at relative aggregate scores across the 10 classes.
However, there is a major variance in scoring which is causing problems:

The HIGH score for two different positive samples for the same class are very different. For sample one, it is "label A": 0.177, and for another sample it is "label A": 0.667. Why so much variance? Do we need to normalize?

Also the order-of-magnitude of LOW scores for different samples varies so much - ranging from 10e-1 to 10e-3. Again - Why so much variance? Do we need to normalize?

This is a follow up to Mayanks post - "major variance in scoring which is causing problems"
Any suggestions how we can fix the variance in scoring?
We were expecting softmax scores - where the multi-class scores/probabilities total to 1 - which is not the case. And we don't have insights into the architecture of the model.

You said you have a "multi-label" textcat problem, does that mean that in fact, one sample text can be annotated with multiple positive labels? Because in that case, the output probabilities wouldn't sum up to 1 - the different labels would be seen as "parallel" classification challenges.

If, however, you have a "multi-class" textcat problem but only 1 class can be positive per sample, we'd need to set exclusive_classes to True in your textcat model, and then a Softmax output layer would indeed be used (https://github.com/explosion/spaCy/blob/master/spacy/_ml.py#L702-L707)

Can you share which exact script you're running to perform the text classification, and what exactly the parameters of your challenge are?