I am trying to understand what ner.batch-train recipe reported stats mean and how best to use it.
Background - I want to use NER to recognize two labels - ARTIST and WORK_OF_ART on a bunch of music video titles. The ARTIST label is new but I decided to re-purpose the WORK_OF_ART label that already exists in some models. Many of the titles in source text contain both labels, some only one of them and some none at all.
I have a 500 titles that I manually annotated with Prodigy (that was surprisingly super quick and slick task), most of them âacceptâ an some ârejectâ answers. I split them into two parts for training and evaluation (used fixed split so the results can be compared between models). I wanted to run these trough ner.batch-train in order to determine which model is the best choice as a starting point. The following models were tested:
- blank
- en_core_web_sm (with and without NER)
- en_core_web_md (with and without NER)
- en_core_web_lg (with and without NER)
Question #1 - Is it safe to assume that the model that gives me best accuracy after training with initial annotations is the best model to work with?
Running ner.batch-train on these with en_core_web_sm gives:
Training model: en_core_web_sm...
Using 2 labels: ARTIST, WORK_OF_ART
Loaded model en_core_web_sm
Loaded 250 evaluation examples from 'eval_dataset'
Using 100% of remaining examples (240) for training
Dropout: 0.2 Batch size: 32 Iterations: 10
BEFORE 0.045
Correct 12
Incorrect 255
Entities 334
Unknown 15
# LOSS RIGHT WRONG ENTS SKIP ACCURACY
01 3.088 28 239 1598 0 0.105
02 3.075 56 211 1221 0 0.210
03 2.509 68 199 1524 0 0.255
04 2.192 87 180 1162 0 0.326
05 2.040 105 162 1314 0 0.393
06 1.864 107 160 1405 0 0.401
07 1.788 117 150 1286 0 0.438
08 1.659 122 145 1402 0 0.457
09 1.540 135 132 1254 0 0.506
10 1.458 128 139 1366 0 0.479
Correct 135
Incorrect 132
Baseline 0.045
Accuracy 0.506
Question #2 - What are the initial stats mean? The correct/incorrect values are the number of individual entities (not samples) that NER got on the evaluation dataset before training?
Then what exactly Entities and Unknown values mean?
Now, if I run ner.batch-train on a en_core_web_sm model with NER disabled or even on a blank modal, I get something like this:
BEFORE 0.034
Correct 9
Incorrect 258
Entities 534
Unknown 525
Question #3 - How can it have correct answers on NER labels with a blank model or model without NER?
Question #4 - The idea was to have initial set of annotation done manually and then use ner.teach to continue collecting samples. But even with the model with the highest accuracy after initial training (en_core_web_md with NER disabled, accuracy ~0.65) ner.teach gives mostly totally irrelevant and meaningless results (over ~90% reject rate), even after over a 1000 answers, it hasnât improved in its suggestions. To the contrary, using ner.make-gold and manually correcting the labels gives much better results. Does it make sense? Why?
Question #5 - Just to reassure Iâm using the right approach, the plan was:
- Have an initial amount of manually annotated labels
- Train a model with those annotations (use the model with best accuracy)
- Use ner.teach with that model to continue adding answers (until�)
- Combine all collected annotations (manual and from ner.teach) to train a model until I get reasonable results
Does the above make sense? Generally what accuracy can I hope to reach in this scenario?
Thanks!