What is the best way for me to improve the accuracy for these two labels in mergers_acquisitions, share_repurchase? Do I need to feed it more annotations? I have over 13K annotations already. Or is there way for me to correct it?
Baseline accuracy: 0.472
=========================== Training the model ===========================
Could it be that the training dataset is unbalanced? The mergers_acquisitions and share_repurchase are lower in proportion the other labels. Does it have to be balanced in the training set for the accuracy to be high?
✘ Accuracy decreased in the last sample
As a rule of thumb, if accuracy increases in the last segment, this could
indicate that collecting more annotations of the same type will improve the
model further.
I balanced the training data and removed the unknown label. The results are better. Should I explicitly model an unknown category? Or level it out an implicit model unknown as any sentence that doesn't meed a min probability threshold? The ROC of 1 looks suspicious to me. Any thoughts?
Created and merged data for 5205 total examples
Using 4164 train / 1041 eval (split 20%)
Component: textcat | Batch size: compounding | Dropout: 0.2 | Iterations: 10
Baseline accuracy: 0.492
=========================== Training the model ===========================
Are the mergers_acquisitions and share_repurchase categories mutually exclusive? I.e., can a text have both labels? If so, I think it's fine to have this as a two-class problem, with a probability threshold to determine whether a label should apply.
If it's not possible for a text to have both labels, I would probably do it as a three-class non-mutually-exclusive problem. So you would introduce the third class, and the highest scoring label would apply.