Retrain trained model with new dataset

How I can retrain the model with new dataset to improve the model score.

Assuming there is new training data, you can rerun the train command to train on more data. Typically this improves the performance of the model.

Note that the "score" of the model does depend on your validation set as well. If you don't have a seperate validation set, then Prodigy will automatically generate one during the train procedure. More information on this can be found on the docs.

Can you please provide a example of commends to run to retraining on exist model or I have to go back the run span.manual from scratch. Please find below commends I runned to create and train the model.

  • python3 -m prodigy spans.manual CV_DOBV1 blank:en ./CV_json_English_Format_4.jsonl --label DOB

python3 -m prodigy train ./CV_DOBV1 --ner CV_DOBV1 --eval-split 0.25

Now I have new data file "12.01.23_100_Cv_Format.jsonl" and I want retrain the model again on the new data.

You can use markdown syntax to highlight your code segments, which makes it easier to read/copy/paste on this forum.

That said, the aforementioned train command can pick up where another model left off. This can be done via --base-model.

python -m prodigy train ... --base-model <path-to-your-model>

Does this not work for you?

Detail

Note that you can also use a pretrained spaCy model here, which is a common starting point. You can do that via:

python -m prodigy train ... --base-model en_core_web_lg

prodigy train --ner New_model --base-model CV_DOBV1/model-last/

I got an error below :

Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/prodigy/main.py", line 61, in
controller = recipe(args, use_plac=True)
File "cython_src/prodigy/core.pyx", line 364, in prodigy.core.recipe.recipe_decorator.recipe_proxy
File "/usr/local/lib/python3.8/dist-packages/plac_core.py", line 367, in call
cmd, result = parser.consume(arglist)
File "/usr/local/lib/python3.8/dist-packages/plac_core.py", line 232, in consume
return cmd, self.func(
(args + varargs + extraopts), **kwargs)
File "/usr/local/lib/python3.8/dist-packages/prodigy/recipes/train.py", line 278, in train
return _train(
File "/usr/local/lib/python3.8/dist-packages/prodigy/recipes/train.py", line 198, in _train
spacy_train(nlp, output_path, use_gpu=gpu_id, stdout=stdout)
File "/usr/local/lib/python3.8/dist-packages/spacy/training/loop.py", line 122, in train
raise e
File "/usr/local/lib/python3.8/dist-packages/spacy/training/loop.py", line 105, in train
for batch, info, is_best_checkpoint in training_step_iterator:
File "/usr/local/lib/python3.8/dist-packages/spacy/training/loop.py", line 200, in train_while_improving
for step, (epoch, batch) in enumerate(train_data):
File "/usr/local/lib/python3.8/dist-packages/spacy/training/loop.py", line 316, in create_train_batches
raise ValueError(Errors.E986)
ValueError: [E986] Could not create any training batches: check your input. Are the train and dev paths defined? Is discard_oversize set appropriately?

That's strange.

I'm wondering if there's something wrong with the trained model you're trying to improve apon.

Just to check, does this command run for you?

prodigy train --ner <your-dataset> --base-model en_core_web_md

Note that the en_core_web_md model should be downloaded beforehand, which you can do via:

python -m spacy download en_core_web_md