textcat F1 score goes up and down, up and down

I'm training a text cat model. I am training a model with emails. Each email has max 150 tokens. I have a separate evaluation set, 5 positive and 5 negative. During the training the loss is going down but the F-score is going up and down and up. Which is different from what i normally see - F1 score going up while the loss goes down. And after 30 iteration, the model's F1 score is 0.63 which is way below baseline 0.8. What could be wrong?

Here is the train output -

Baseline accuracy: 0.800

Loss F-Score


1 16.84 0.500
2 12.83 0.367
3 8.44 0.667
4 6.25 0.800
5 7.50 0.667
6 4.91 0.633
7 4.32 0.900
8 4.07 0.800
9 2.79 0.800
10 2.56 0.733
11 0.98 0.767
12 0.19 0.700
13 0.02 0.700
14 0.00 0.667
15 0.10 0.633
16 0.05 0.633
17 0.01 0.633
18 0.01 0.600
19 0.03 0.600
20 0.02 0.567
21 0.00 0.533
22 0.10 0.533
23 0.03 0.500
24 0.00 0.533
25 0.00 0.533
26 0.02 0.533
27 0.06 0.533
28 0.00 0.567
29 0.00 0.633
30 0.00 0.633

============================= :sparkles: Results summary =============================

Label ROC AUC


mnpi 0.900

Best ROC AUC 0.900
Baseline 0.800

How many examples are you training with in total? And does this mean your evaluation set only contains 10 exampes?

If you are in fact only evaluating on 10 examples, this could explain a lot, because it's really difficult to come to any conclusive results if your dataset is that small and you might be seeing the accuracy jump around like this. Just to put this into perspective: if you're training a binary classification model and you're evaluating on 10 examples, a small difference that leads to a single mistake would lead to a 10% decrease in accuracy.

Thank you very much for the reply. You are absolute on it.
After I added more data in the evaluation set, the F1 score gets higher while the loss gets lower.