NER prodigy train with existing model

mystuff · July 14, 2020, 9:48pm

Hello @ines
i have a model in production with 1 million dataset. Now i need add extra 10k dataset. So used ner.correct with existing model.

Now its time to train the model with the new datsaet. I don't know how to attach this new 10k model to the existing model so I would like to use below commands to build a new model with the existing model.

prodigy train ner dataset_10k en_spacy_prod-1.0 --n-iter 10 --eval-split 0.2 --dropout 0.2

spacy train en ./ner_model_2.0 train_10k eval_10k --base-model en_spacy_prod-1.0 --pipeline ner --n-iter 30

am i on correct path in terms of train commands?. Please correct me if i am doing wrong.

ines · July 15, 2020, 11:15am

I think this looks correct, yes!

Do you still have the 1 million dataset, though? Because if you do, you might as well add your extra 10k examples to it and then retrain your model from scratch using the whole corpus. This might give you more reliable results and prevents catastrophic forgetting effects etc.

mystuff · July 15, 2020, 6:22pm

Hi @ines, thanks. yes i do have 1M. i will do that as well.

mystuff · August 10, 2020, 8:55am

@ines

spacy train en ./ner_v_8.0.0 dataset_1M/train dataset_1M/test/ --pipeline ner

My 1M is divided into multiple files under train and test folders. do i need to shuffle newly annotated 10 or 20k dataset with the existing 1M dataset before training or can i split 10k into 8k, 2k then add to the train and test folder?

does this below line shuffle the data from all the files?

corpus = GoldCorpus(train_path, dev_path, limit=n_examples)

honnibal · August 12, 2020, 1:54pm

You don't need to shuffle the data in, it will separately shuffle the training and evaluation data sets, so it doesn't need to be mixed throughout your files.

You could consider only adding the 10k samples to the training set, so you can directly compare the results with the previous model. Or you could divide your new 10k into two parts, e.g. 8k words and 2k words, and use the 8k in the training data and then the 2k sample could be a separate evaluation.

mystuff · August 13, 2020, 9:55pm

Thanks @honnibal. That's exactly i am thinking.

mystuff · September 28, 2020, 8:34am

@ines and @honnibal,
After adding my 10K datset and re-trained, my F1 score dropped to 1%. May I know what will be reason? . Dataset is almost similar and just added different variations.

ines · September 28, 2020, 6:44pm

That's super difficult to say and can have many reasons. Are you using a dedicated evaluation set? If not, and you're just holding back a random portion of the data, any variation here can cause your results to be different and makes them difficult to compare. It's also possible that your conversion introduced some inconsistencies in the data, so you'd have to run some experiments and see if you can track it down.

Topic		Replies	Views
Tune existing Spacy NER model usage , ner	5	308	April 16, 2022
Iterating on a NER spaCy model with Prodigy usage , ner , spacy , solved	3	403	July 21, 2020
Train NER model to improve existing entities spacy vs prodigy ner , spacy	1	952	December 9, 2019
Add items to existing NER entity and update existing trained data set usage , ner , solved	2	1284	May 15, 2018
Using the output of ner.gold-to-spacy to train a new model ner , spacy	3	1053	April 4, 2018

NER prodigy train with existing model

Related topics