How do I improve an individual label in textcat?

pl6306 · February 8, 2020, 5:29pm

What is the best way for me to improve the accuracy for these two labels in mergers_acquisitions, share_repurchase? Do I need to feed it more annotations? I have over 13K annotations already. Or is there way for me to correct it?

Baseline accuracy: 0.472

=========================== Training the model ===========================

Loss F-Score

1 571.96 0.822
2 0.54 0.822
3 0.54 0.823
4 0.54 0.823
5 0.53 0.824
6 0.53 0.823
7 0.52 0.818
8 0.51 0.817
9 0.52 0.805
10 0.50 0.778

============================= Results summary =============================

Label ROC AUC

share_repurchase 0.498
dividend 1.000
unknown 0.971
organic_growth 0.998
mergers_acquisitions 0.495
debt_reduction 0.984

Best ROC AUC 0.824
Baseline 0.472

pl6306 · February 8, 2020, 5:45pm

Could it be that the training dataset is unbalanced? The mergers_acquisitions and share_repurchase are lower in proportion the other labels. Does it have to be balanced in the training set for the accuracy to be high?

pl6306 · February 8, 2020, 6:28pm

I guess train-curve answers it. What should be the next step? Besides balancing the training data are there any other ideas? Thanks!

Starting with model 'en_core_web_md'
Training 4 times with 25%, 50%, 75%, 100% of the data

=============================== Train curve ===============================
% Accuracy Difference

0% 0.50 baseline
25% 0.99 +0.49
50% 0.91 -0.08
75% 0.99 +0.09
100% 0.99 -0.00

✘ Accuracy decreased in the last sample
As a rule of thumb, if accuracy increases in the last segment, this could
indicate that collecting more annotations of the same type will improve the
model further.

pl6306 · February 8, 2020, 7:38pm

I balanced the training data and removed the unknown label. The results are better. Should I explicitly model an unknown category? Or level it out an implicit model unknown as any sentence that doesn't meed a min probability threshold? The ROC of 1 looks suspicious to me. Any thoughts?

Created and merged data for 5205 total examples
Using 4164 train / 1041 eval (split 20%)
Component: textcat | Batch size: compounding | Dropout: 0.2 | Iterations: 10
Baseline accuracy: 0.492

=========================== Training the model ===========================

Loss F-Score

1 351.41 0.999
2 0.13 1.000
3 0.03 1.000
4 0.01 1.000
5 0.00 1.000
6 0.00 1.000
7 0.00 0.999
8 0.00 0.999
9 0.00 0.999
10 0.00 0.999

============================= Results summary =============================

Label ROC AUC

share_repurchase 1.000
dividend 1.000
mergers_acquisitions 1.000
organic_growth 1.000
debt_reduction 1.000

Best ROC AUC 1.000
Baseline 0.492

honnibal · February 11, 2020, 12:16am

Are the mergers_acquisitions and share_repurchase categories mutually exclusive? I.e., can a text have both labels? If so, I think it's fine to have this as a two-class problem, with a probability threshold to determine whether a label should apply.

If it's not possible for a text to have both labels, I would probably do it as a three-class non-mutually-exclusive problem. So you would introduce the third class, and the highest scoring label would apply.

Does that answer your question?

Topic		Replies	Views
Inconsistent results textcat	4	445	March 20, 2020
TextCat Training Results on a per label basis. usage , textcat	1	418	February 18, 2019
textcat.teach showing same text twice (and not using active learning?) textcat	15	2229	August 15, 2018
Best way to annotate rare labels for classification usage , textcat	8	907	January 22, 2019
Can't improve textcat model performance textcat	2	350	May 3, 2020

How do I improve an individual label in textcat?

Loss F-Score

Loss F-Score

Related Topics