EFgit
(Egzon Syka)
January 30, 2019, 10:45am
1
I’m trying to build a model using only annotations from ner.make-gold
. The logic was:
do some examples of GOLD
train a model with batch train using --no-missing
argument
then repeat the first step with the new model and suggestions get better (no need for too much manual interventions)
Commands and output:
…
> prodigy ner.batch-train personal_info_gold_new prodigy_models/personal_info_gold_new2 -o prodigy_models/personal_info_gold_new3 --n-iter 10 --eval-split 0.2 --dropout 0.2 --no-missing
BEFORE 0.652
Correct 15
Incorrect 8
Entities 17
Unknown 0
AFTER
Correct 20
Incorrect 6
Baseline 0.652
Accuracy 0.769
Added 650 annotations
> prodigy ner.batch-train personal_info_gold_new prodigy_models/personal_info_gold_new3 -o prodigy_models/personal_info_gold_new4 --n-iter 10 --eval-split 0.2 --dropout 0.2 --no-missing
BEFORE 0.794
Correct 50
Incorrect 13
Entities 55
Unknown 0
AFTER
Correct 46
Incorrect 23
Baseline 0.794
Accuracy 0.667
So, until this iteration the results were getting better then suddenly started to get worse.
What could be the problem here?
ines
(Ines Montani)
January 30, 2019, 12:15pm
2
Just to confirm: It looks like you’re using one dataset, personal_info_gold_new
and then keep updating the model artifact produced in the previous step, right? Can you reproduce the same results if you’re always updating the base model (e.g. en_core_web_sm
or whatever else you used)?
EFgit
(Egzon Syka)
January 30, 2019, 1:28pm
3
Yes, personal_info_gold_new
is the dataset I created to save the gold data I annotate in each iteration...
I started annotating using the model that I generated using terms.train-vectors
(prodigy_models/resumes_model1 ) like this:
prodigy ner.make-gold personal_info_gold_new prodigy_models/resumes_model1 data/jsonl/en_complete_316.jsonl --label "PERSON, EMAIL, BIRTH_DATE, PHONE_NUMBER, SOCIAL_MEDIA"
Then on batch-train I used again the same model (from which I think only the tokenizer is used):
prodigy ner.batch-train personal_info_gold_new prodigy_models/resumes_model1 -o prodigy_models/personal_info_gold_new --n-iter 10 --eval-split 0.2 --dropout 0.2
So, if I am understanding correctly you are asking if I can reproduce the same results if I stay with the first model which in this case would be: prodigy_models/resumes_model1 in all the iterations to come?
ines
(Ines Montani)
January 30, 2019, 1:30pm
4
Yes, exactly. Since you're always updating with the full gold dataset, the result should be the same.
EFgit
(Egzon Syka)
January 30, 2019, 1:47pm
6
Yes, that was also what I thought but I got this:
prodigy ner.batch-train personal_info_gold_new prodigy_models/resumes_model1 -o prodigy_models/personal_info_gold_new4 --n-iter 10 --eval-split 0.2 --dropout 0.2 --no-missing
BEFORE 0.004
Correct 9
Incorrect 2534
Entities 2494
Unknown 0
Correct 35
Incorrect 29
Baseline 0.004
Accuracy 0.547
and now I am confused...