I'm a bit puzzled by the results I get from a classification task. I've tried two approaches where I'd expect very similar results but that's not the case.
First approach
I've used binary annotation scheme so I have text
, label
and answer
for each example. I.e. having the same label in each example but with different answers.
❯ prodigy train textcat modified blank:en
✔ Loaded model 'blank:en'
Created and merged data for 3290 total examples
Using 2632 train / 658 eval (split 20%)
Component: textcat | Batch size: compounding | Dropout: 0.2 | Iterations: 10
ℹ Baseline accuracy: 0.510
=========================== ✨ Training the model ===========================
# Loss F-Score
-- -------- --------
1 122.44 0.835
2 2.15 0.870
3 0.42 0.899
4 0.14 0.908
5 0.06 0.917
6 0.04 0.917
7 0.03 0.920
8 0.03 0.924
9 0.03 0.926
10 0.03 0.930
============================= ✨ Results summary =============================
Label ROC AUC
------- -------
OUTLOOK 0.930
Best ROC AUC 0.930
Baseline 0.510
Second approach
I've tried transforming label
into OUTLOOK
or NOT_OUTLOOK
based on the answer and then removing answer
from my examples and required exclusive label classification in my training. My results now looks like this
❯ prodigy train textcat modified2 blank:en --textcat-exclusive
✔ Loaded model 'blank:en'
Created and merged data for 3290 total examples
Using 2632 train / 658 eval (split 20%)
Component: textcat | Batch size: compounding | Dropout: 0.2 | Iterations: 10
ℹ Baseline accuracy: 45.485
=========================== ✨ Training the model ===========================
# Loss F-Score
-- -------- --------
1 246.91 61.949
2 4.09 69.164
3 0.83 71.414
4 0.27 74.955
5 0.12 78.210
6 0.07 81.008
7 0.06 81.369
8 0.06 81.302
9 0.06 81.942
10 0.06 81.656
============================= ✨ Results summary =============================
Label F-Score
----------- -------
NOT_OUTLOOK 94.319
OUTLOOK 69.565
Best F-Score 81.942
Baseline 45.485
I'm curios why I'm seeing such different results and why one of the methods seems to give me F1 in % and the other not (bug?). From the looks of this it seems that the first approach is definitely the way to go or am I reading this wrong?