Inconsistent results

I'm a bit puzzled by the results I get from a classification task. I've tried two approaches where I'd expect very similar results but that's not the case.

First approach

I've used binary annotation scheme so I have text, label and answer for each example. I.e. having the same label in each example but with different answers.

❯ prodigy train textcat modified blank:en
✔ Loaded model 'blank:en'
Created and merged data for 3290 total examples
Using 2632 train / 658 eval (split 20%)
Component: textcat | Batch size: compounding | Dropout: 0.2 | Iterations: 10
ℹ Baseline accuracy: 0.510


=========================== ✨  Training the model ===========================

#    Loss       F-Score 
--   --------   --------
1    122.44     0.835                                                                                                                                                                                                                      
2    2.15       0.870                                                                                                                                                                                                                      
3    0.42       0.899                                                                                                                                                                                                                      
4    0.14       0.908                                                                                                                                                                                                                      
5    0.06       0.917                                                                                                                                                                                                                      
6    0.04       0.917                                                                                                                                                                                                                      
7    0.03       0.920                                                                                                                                                                                                                      
8    0.03       0.924                                                                                                                                                                                                                      
9    0.03       0.926                                                                                                                                                                                                                      
10   0.03       0.930                                                                                                                                                                                                                      

============================= ✨  Results summary =============================

Label     ROC AUC
-------   -------
OUTLOOK     0.930


Best ROC AUC   0.930
Baseline       0.510         

Second approach

I've tried transforming label into OUTLOOK or NOT_OUTLOOK based on the answer and then removing answer from my examples and required exclusive label classification in my training. My results now looks like this

❯ prodigy train textcat modified2 blank:en --textcat-exclusive
✔ Loaded model 'blank:en'
Created and merged data for 3290 total examples
Using 2632 train / 658 eval (split 20%)
Component: textcat | Batch size: compounding | Dropout: 0.2 | Iterations: 10
ℹ Baseline accuracy: 45.485

=========================== ✨  Training the model ===========================

#    Loss       F-Score 
--   --------   --------
1    246.91     61.949                                                                                                                                                                                                                     
2    4.09       69.164                                                                                                                                                                                                                     
3    0.83       71.414                                                                                                                                                                                                                     
4    0.27       74.955                                                                                                                                                                                                                     
5    0.12       78.210                                                                                                                                                                                                                     
6    0.07       81.008                                                                                                                                                                                                                     
7    0.06       81.369                                                                                                                                                                                                                     
8    0.06       81.302                                                                                                                                                                                                                     
9    0.06       81.942                                                                                                                                                                                                                     
10   0.06       81.656                                                                                                                                                                                                                     

============================= ✨  Results summary =============================

Label         F-Score
-----------   -------
NOT_OUTLOOK    94.319
OUTLOOK        69.565


Best F-Score   81.942
Baseline       45.485             

I'm curios why I'm seeing such different results and why one of the methods seems to give me F1 in % and the other not (bug?). From the looks of this it seems that the first approach is definitely the way to go or am I reading this wrong?

Which versions of spaCy and Prodigy are you using? And if you're not on the latest, could you try again with the latest spaCy (v2.2.4) and Prodig (v1.9.9)? We recently fixed a few training-related issues so it's possible that those will affect your results as well.

Sorry I forgot to put that in. Just started a clean virtual environment in python 3.8.2

  • spacy: 2.2.4
  • prodigy: 1.9.9

(Off topic: I don't know if you are aware but https://github.com/explosion/spacy-models/releases/download/en_vectors_web_lg-2.1.0/en_vectors_web_lg-2.1.0.tar.gz is downloading with ~30kB/s at the moment - I'm in Copenhagen. Might be a github issue of course)

I don't really understand why your results would be like that either, hm. It should be pretty similar. Are you sure you did the transformation correctly?

Yeah but for some reason I can't reach 93% F1 any more in the first one - now it is in the same area as the second approach. Maybe it was just a rare train test split that yielded the results!?

But good to know that I should expect the same results. I'll keep testing and post if I continue to see anything odd