How do I improve an individual label in textcat?

What is the best way for me to improve the accuracy for these two labels in mergers_acquisitions, share_repurchase? Do I need to feed it more annotations? I have over 13K annotations already. Or is there way for me to correct it?

:information_source: Baseline accuracy: 0.472

=========================== :sparkles: Training the model ===========================

Loss F-Score


1 571.96 0.822
2 0.54 0.822
3 0.54 0.823
4 0.54 0.823
5 0.53 0.824
6 0.53 0.823
7 0.52 0.818
8 0.51 0.817
9 0.52 0.805
10 0.50 0.778

============================= :sparkles: Results summary =============================

Label ROC AUC


share_repurchase 0.498
dividend 1.000
unknown 0.971
organic_growth 0.998
mergers_acquisitions 0.495
debt_reduction 0.984

Best ROC AUC 0.824
Baseline 0.472

Could it be that the training dataset is unbalanced? The mergers_acquisitions and share_repurchase are lower in proportion the other labels. Does it have to be balanced in the training set for the accuracy to be high?

I guess train-curve answers it. What should be the next step? Besides balancing the training data are there any other ideas? Thanks!

:heavy_check_mark: Starting with model 'en_core_web_md'
Training 4 times with 25%, 50%, 75%, 100% of the data

=============================== :sparkles: Train curve ===============================
% Accuracy Difference


0% 0.50 baseline
25% 0.99 +0.49
50% 0.91 -0.08
75% 0.99 +0.09
100% 0.99 -0.00

✘ Accuracy decreased in the last sample
As a rule of thumb, if accuracy increases in the last segment, this could
indicate that collecting more annotations of the same type will improve the
model further.

I balanced the training data and removed the unknown label. The results are better. Should I explicitly model an unknown category? Or level it out an implicit model unknown as any sentence that doesn't meed a min probability threshold? The ROC of 1 looks suspicious to me. Any thoughts?

Created and merged data for 5205 total examples
Using 4164 train / 1041 eval (split 20%)
Component: textcat | Batch size: compounding | Dropout: 0.2 | Iterations: 10
:information_source: Baseline accuracy: 0.492

=========================== :sparkles: Training the model ===========================

Loss F-Score


1 351.41 0.999
2 0.13 1.000
3 0.03 1.000
4 0.01 1.000
5 0.00 1.000
6 0.00 1.000
7 0.00 0.999
8 0.00 0.999
9 0.00 0.999
10 0.00 0.999

============================= :sparkles: Results summary =============================

Label ROC AUC


share_repurchase 1.000
dividend 1.000
mergers_acquisitions 1.000
organic_growth 1.000
debt_reduction 1.000

Best ROC AUC 1.000
Baseline 0.492

Are the mergers_acquisitions and share_repurchase categories mutually exclusive? I.e., can a text have both labels? If so, I think it's fine to have this as a two-class problem, with a probability threshold to determine whether a label should apply.

If it's not possible for a text to have both labels, I would probably do it as a three-class non-mutually-exclusive problem. So you would introduce the third class, and the highest scoring label would apply.

Does that answer your question?