i have a model in production with 1 million dataset. Now i need add extra 10k dataset. So used ner.correct with existing model.
Now its time to train the model with the new datsaet. I don't know how to attach this new 10k model to the existing model so I would like to use below commands to build a new model with the existing model.
prodigy train ner dataset_10k en_spacy_prod-1.0 --n-iter 10 --eval-split 0.2 --dropout 0.2
spacy train en ./ner_model_2.0 train_10k eval_10k --base-model en_spacy_prod-1.0 --pipeline ner --n-iter 30
am i on correct path in terms of train commands?. Please correct me if i am doing wrong.
I think this looks correct, yes!
Do you still have the 1 million dataset, though? Because if you do, you might as well add your extra 10k examples to it and then retrain your model from scratch using the whole corpus. This might give you more reliable results and prevents catastrophic forgetting effects etc.
Hi @ines, thanks. yes i do have 1M. i will do that as well.
spacy train en ./ner_v_8.0.0 dataset_1M/train dataset_1M/test/ --pipeline ner
My 1M is divided into multiple files under train and test folders. do i need to shuffle newly annotated 10 or 20k dataset with the existing 1M dataset before training or can i split 10k into 8k, 2k then add to the train and test folder?
does this below line shuffle the data from all the files?
corpus = GoldCorpus(train_path, dev_path, limit=n_examples)
You don't need to shuffle the data in, it will separately shuffle the training and evaluation data sets, so it doesn't need to be mixed throughout your files.
You could consider only adding the 10k samples to the training set, so you can directly compare the results with the previous model. Or you could divide your new 10k into two parts, e.g. 8k words and 2k words, and use the 8k in the training data and then the 2k sample could be a separate evaluation.
Thanks @honnibal. That's exactly i am thinking.
@ines and @honnibal,
After adding my 10K datset and re-trained, my F1 score dropped to 1%. May I know what will be reason? . Dataset is almost similar and just added different variations.
That's super difficult to say and can have many reasons. Are you using a dedicated evaluation set? If not, and you're just holding back a random portion of the data, any variation here can cause your results to be different and makes them difficult to compare. It's also possible that your conversion introduced some inconsistencies in the data, so you'd have to run some experiments and see if you can track it down.