Adding more gold annotations decreases the accuracy in gold-model

I’m trying to build a model using only annotations from ner.make-gold. The logic was:

  • do some examples of GOLD
  • train a model with batch train using --no-missing argument
  • then repeat the first step with the new model and suggestions get better (no need for too much manual interventions)

Commands and output:

  • > prodigy ner.batch-train personal_info_gold_new prodigy_models/personal_info_gold_new2 -o prodigy_models/personal_info_gold_new3 --n-iter 10 --eval-split 0.2 --dropout 0.2 --no-missing
    
            BEFORE     0.652     
            Correct    15
            Incorrect  8
            Entities   17        
            Unknown    0                                                                                           
    
            AFTER
            Correct    20
            Incorrect  6
            Baseline   0.652     
            Accuracy   0.769 
    
  • Added 650 annotations

  • > prodigy ner.batch-train personal_info_gold_new prodigy_models/personal_info_gold_new3 -o prodigy_models/personal_info_gold_new4 --n-iter 10 --eval-split 0.2 --dropout 0.2 --no-missing
    
           BEFORE     0.794     
           Correct    50
           Incorrect  13
           Entities   55        
           Unknown    0   
    
           AFTER
           Correct    46
           Incorrect  23
           Baseline   0.794     
           Accuracy   0.667 
    

So, until this iteration the results were getting better then suddenly started to get worse.
What could be the problem here?

Just to confirm: It looks like you’re using one dataset, personal_info_gold_new and then keep updating the model artifact produced in the previous step, right? Can you reproduce the same results if you’re always updating the base model (e.g. en_core_web_sm or whatever else you used)?

Yes, personal_info_gold_new is the dataset I created to save the gold data I annotate in each iteration...

I started annotating using the model that I generated using terms.train-vectors (prodigy_models/resumes_model1) like this:

 prodigy ner.make-gold personal_info_gold_new prodigy_models/resumes_model1 data/jsonl/en_complete_316.jsonl --label "PERSON, EMAIL, BIRTH_DATE, PHONE_NUMBER, SOCIAL_MEDIA"

Then on batch-train I used again the same model (from which I think only the tokenizer is used):

prodigy ner.batch-train personal_info_gold_new prodigy_models/resumes_model1 -o prodigy_models/personal_info_gold_new --n-iter 10 --eval-split 0.2 --dropout 0.2

So, if I am understanding correctly you are asking if I can reproduce the same results if I stay with the first model which in this case would be: prodigy_models/resumes_model1 in all the iterations to come?

Yes, exactly. Since you're always updating with the full gold dataset, the result should be the same.

Yes, that was also what I thought but I got this:

prodigy ner.batch-train personal_info_gold_new prodigy_models/resumes_model1 -o prodigy_models/personal_info_gold_new4 --n-iter 10 --eval-split 0.2 --dropout 0.2 --no-missing

BEFORE 0.004
Correct 9
Incorrect 2534
Entities 2494
Unknown 0

Correct 35
Incorrect 29
Baseline 0.004
Accuracy 0.547

and now I am confused...