I built a pipeline to retrain my models using new input data.
Here is an example for improving french model on [‘ORG’, ‘PRODUCT’, ‘PER’]
When I train, I get that:
Loaded model fr_core_news_sm
Using 20% of accept/reject examples (292) for evaluation
Using 100% of remaining examples (1172) for training
Dropout: 0.2 Batch size: 16 Iterations: 10
BEFORE 0.379
Correct 153
Incorrect 251
Entities 885
Unknown 279
01 56.563 275 129 463 0 0.681
02 53.317 296 108 428 0 0.733
03 58.704 318 86 437 0 0.787
04 53.754 327 77 468 0 0.809
05 56.246 334 70 488 0 0.827
06 54.256 339 65 539 0 0.839
07 58.559 340 64 541 0 0.842
08 74.089 338 66 672 0 0.837
09 62.757 337 67 729 0 0.834
10 63.274 345 59 1305 0 0.854
Correct 345
Incorrect 59
Baseline 0.379
Accuracy 0.854
Model: /Users/iero/models/temporary
Training data: /Users/iero/models/temporary/training.jsonl
Evaluation data: /Users/iero/models/temporary/evaluation.jsonl
First question : I was looking into /Users/iero/models/temporary
directory for above information (ie accuracy). Do you keep those numbers somewhere ? I will use that to perform an aval-ab and see if my model is improved since last train.
Second question: I use fr_core_news_sm
In meta.json file, I see a reference to core_news_sm
. Is it normal ?
"author":"Explosion AI",
"notes":"Because the model is trained on Wikipedia, it may perform inconsistently on many genres, such as social media text. The NER accuracy refers to the \"silver standard\" annotations in the WikiNER corpus. Accuracy on these annotations tends to be higher than correct human annotations.",
"Sequoia Corpus (UD)",
"description":"French multi-task CNN trained on the French Sequoia (Universal Dependencies) and WikiNER corpus. Assigns context-specific token vectors, POS tags, dependency parse and named entities. Supports identification of PER, LOC, ORG and MISC entities.",
Third question: If the answer of first question is ‘no’, Can I update meta.json to keep these training information ?