I am trying to understand what ner.batch-train recipe reported stats mean and how best to use it.
Background - I want to use NER to recognize two labels - ARTIST and WORK_OF_ART on a bunch of music video titles. The ARTIST label is new but I decided to re-purpose the WORK_OF_ART label that already exists in some models. Many of the titles in source text contain both labels, some only one of them and some none at all.
I have a 500 titles that I manually annotated with Prodigy (that was surprisingly super quick and slick task), most of them ‘accept’ an some ‘reject’ answers. I split them into two parts for training and evaluation (used fixed split so the results can be compared between models). I wanted to run these trough ner.batch-train in order to determine which model is the best choice as a starting point. The following models were tested:
- en_core_web_sm (with and without NER)
- en_core_web_md (with and without NER)
- en_core_web_lg (with and without NER)
Question #1 - Is it safe to assume that the model that gives me best accuracy after training with initial annotations is the best model to work with?
Running ner.batch-train on these with en_core_web_sm gives:
Training model: en_core_web_sm... Using 2 labels: ARTIST, WORK_OF_ART Loaded model en_core_web_sm Loaded 250 evaluation examples from 'eval_dataset' Using 100% of remaining examples (240) for training Dropout: 0.2 Batch size: 32 Iterations: 10 BEFORE 0.045 Correct 12 Incorrect 255 Entities 334 Unknown 15 # LOSS RIGHT WRONG ENTS SKIP ACCURACY 01 3.088 28 239 1598 0 0.105 02 3.075 56 211 1221 0 0.210 03 2.509 68 199 1524 0 0.255 04 2.192 87 180 1162 0 0.326 05 2.040 105 162 1314 0 0.393 06 1.864 107 160 1405 0 0.401 07 1.788 117 150 1286 0 0.438 08 1.659 122 145 1402 0 0.457 09 1.540 135 132 1254 0 0.506 10 1.458 128 139 1366 0 0.479 Correct 135 Incorrect 132 Baseline 0.045 Accuracy 0.506
Question #2 - What are the initial stats mean? The correct/incorrect values are the number of individual entities (not samples) that NER got on the evaluation dataset before training?
Then what exactly Entities and Unknown values mean?
Now, if I run ner.batch-train on a en_core_web_sm model with NER disabled or even on a blank modal, I get something like this:
BEFORE 0.034 Correct 9 Incorrect 258 Entities 534 Unknown 525
Question #3 - How can it have correct answers on NER labels with a blank model or model without NER?
Question #4 - The idea was to have initial set of annotation done manually and then use ner.teach to continue collecting samples. But even with the model with the highest accuracy after initial training (en_core_web_md with NER disabled, accuracy ~0.65) ner.teach gives mostly totally irrelevant and meaningless results (over ~90% reject rate), even after over a 1000 answers, it hasn’t improved in its suggestions. To the contrary, using ner.make-gold and manually correcting the labels gives much better results. Does it make sense? Why?
Question #5 - Just to reassure I’m using the right approach, the plan was:
- Have an initial amount of manually annotated labels
- Train a model with those annotations (use the model with best accuracy)
- Use ner.teach with that model to continue adding answers (until…?)
- Combine all collected annotations (manual and from ner.teach) to train a model until I get reasonable results
Does the above make sense? Generally what accuracy can I hope to reach in this scenario?