Inconsistent results

nix411 · March 18, 2020, 12:48pm

I'm a bit puzzled by the results I get from a classification task. I've tried two approaches where I'd expect very similar results but that's not the case.

First approach

I've used binary annotation scheme so I have text, label and answer for each example. I.e. having the same label in each example but with different answers.

❯ prodigy train textcat modified blank:en
✔ Loaded model 'blank:en'
Created and merged data for 3290 total examples
Using 2632 train / 658 eval (split 20%)
Component: textcat | Batch size: compounding | Dropout: 0.2 | Iterations: 10
ℹ Baseline accuracy: 0.510


=========================== ✨  Training the model ===========================

#    Loss       F-Score 
--   --------   --------
1    122.44     0.835                                                                                                                                                                                                                      
2    2.15       0.870                                                                                                                                                                                                                      
3    0.42       0.899                                                                                                                                                                                                                      
4    0.14       0.908                                                                                                                                                                                                                      
5    0.06       0.917                                                                                                                                                                                                                      
6    0.04       0.917                                                                                                                                                                                                                      
7    0.03       0.920                                                                                                                                                                                                                      
8    0.03       0.924                                                                                                                                                                                                                      
9    0.03       0.926                                                                                                                                                                                                                      
10   0.03       0.930                                                                                                                                                                                                                      

============================= ✨  Results summary =============================

Label     ROC AUC
-------   -------
OUTLOOK     0.930


Best ROC AUC   0.930
Baseline       0.510

Second approach

I've tried transforming label into OUTLOOK or NOT_OUTLOOK based on the answer and then removing answer from my examples and required exclusive label classification in my training. My results now looks like this

❯ prodigy train textcat modified2 blank:en --textcat-exclusive
✔ Loaded model 'blank:en'
Created and merged data for 3290 total examples
Using 2632 train / 658 eval (split 20%)
Component: textcat | Batch size: compounding | Dropout: 0.2 | Iterations: 10
ℹ Baseline accuracy: 45.485

=========================== ✨  Training the model ===========================

#    Loss       F-Score 
--   --------   --------
1    246.91     61.949                                                                                                                                                                                                                     
2    4.09       69.164                                                                                                                                                                                                                     
3    0.83       71.414                                                                                                                                                                                                                     
4    0.27       74.955                                                                                                                                                                                                                     
5    0.12       78.210                                                                                                                                                                                                                     
6    0.07       81.008                                                                                                                                                                                                                     
7    0.06       81.369                                                                                                                                                                                                                     
8    0.06       81.302                                                                                                                                                                                                                     
9    0.06       81.942                                                                                                                                                                                                                     
10   0.06       81.656                                                                                                                                                                                                                     

============================= ✨  Results summary =============================

Label         F-Score
-----------   -------
NOT_OUTLOOK    94.319
OUTLOOK        69.565


Best F-Score   81.942
Baseline       45.485

I'm curios why I'm seeing such different results and why one of the methods seems to give me F1 in % and the other not (bug?). From the looks of this it seems that the first approach is definitely the way to go or am I reading this wrong?

ines · March 18, 2020, 1:07pm

Which versions of spaCy and Prodigy are you using? And if you're not on the latest, could you try again with the latest spaCy (v2.2.4) and Prodig (v1.9.9)? We recently fixed a few training-related issues so it's possible that those will affect your results as well.

nix411 · March 18, 2020, 1:10pm

Sorry I forgot to put that in. Just started a clean virtual environment in python 3.8.2

spacy: 2.2.4
prodigy: 1.9.9

(Off topic: I don't know if you are aware but https://github.com/explosion/spacy-models/releases/download/en_vectors_web_lg-2.1.0/en_vectors_web_lg-2.1.0.tar.gz is downloading with ~30kB/s at the moment - I'm in Copenhagen. Might be a github issue of course)

honnibal · March 20, 2020, 3:24pm

I don't really understand why your results would be like that either, hm. It should be pretty similar. Are you sure you did the transformation correctly?

nix411 · March 20, 2020, 6:16pm

Yeah but for some reason I can't reach 93% F1 any more in the first one - now it is in the same area as the second approach. Maybe it was just a rare train test split that yielded the results!?

But good to know that I should expect the same results. I'll keep testing and post if I continue to see anything odd

Topic		Replies	Views
Best practices & realistic expectations with high number of classes for multiclass text classification task usage , textcat , spacy	2	1154	August 27, 2019
textcat.teach not taking into account label value textcat , done	4	603	December 7, 2018
textcat.batch-train versus spacy classificaion example usage , textcat , spacy	4	545	March 30, 2019
Textcat possible problem with uneven dataset? usage , textcat , done	2	956	January 17, 2020
Imbalanced classes in a multiclass textcat leads to completely biased predictions usage , textcat	7	4027	February 21, 2018

Inconsistent results

First approach

Second approach

Related topics