PRODIDGY_CONFIG_OVERRIDES setting dropout batch_size and eval_frequency not working

I have setup PRODIGY_CONFIG_OVERRIDES as follows:

export PRODIGY_CONFIG_OVERRIDES='{"batch_size":32, "dropout":0.3, "eval_frequency":100}'

When I run the training using prodigy and the logging set to VERBOSE,

10:39:23: INIT: Setting all logging levels to 10
10:39:23: RECIPE: Calling recipe 'train'
:information_source: Using GPU: 0

========================= Generating Prodigy config =========================
:information_source: Auto-generating config with spaCy
10:39:24: CONFIG: Using config from global prodigy.json
/home/xxxxxxxx/.prodigy/prodigy.json

10:39:24: CONFIG: Merging config from CLI overrides
{'batch_size': 32, 'dropout': 0.30000000000000004, 'eval_frequency': 100}

However, the program then proceeds as if none of the settings are changed. The generated config.cfg file for the model also does not show the batch_size, dropout and eval_frequency settings that I have set in the overrides.

Is there anyway to actually CONFIRM that it using these settings? Also, I have tried adding the section names as a prefix to the settings and no change: (e.g. "training.eval_frequency": 100)

Thanks,

Michael Wade

P.S. I am only setting the eval_frequency so I can see if it is picking up my overrides.

Hi! The problem here is that the PRODIGY_CONFIG_OVERRIDES setting is intended to overwrite the Prodigy configuration, i.e. everything that's typically in your prodigy.json: https://prodi.gy/docs/install#config

In your case, it sounds like you want to overwrite spaCy training config parameters. You can do this just like you would in spacy train, .i.e. by providing them as CLI overrides: https://prodi.gy/docs/install#config For example:

prodigy train --ner foo,bar --training.batch_size 32

I went ahead and tried it using using the following command line:

prodigy train --label-stats --ner drd_interrog_ner_gold,drd_interrog_ner_silver ./prod_models_silver10_trf --gpu-id 0 --verbose --base-model en_core_web_trf --training.batch_size 32

I then get the following error:

=========================== Initializing pipeline ===========================
✘ Config validation error
training -> batch_size extra fields not permitted

{'train_corpus': 'corpora.train', 'dev_corpus': 'corpora.dev', 'seed': 0, 'gpu_allocator': None, 'dropout': 0.1, 'accumulate_gradient': 3, 'patience': 5000, 'max_epochs': 0, 'max_steps': 20000, 'eval_frequency': 1000, 'frozen_components': ['tagger', 'parser', 'attribute_ruler', 'lemmatizer'], 'before_to_disk': {'@misc': 'prodigy.todisk_cleanup.v1'}, 'annotating_components': [], 'logger': {'@loggers': 'prodigy.ConsoleLogger.v1'}, 'batch_size': 32, 'batcher': {'@batchers': 'spacy.batch_by_padded.v1', 'discard_oversize': True, 'get_length': None, 'size': 2000, 'buffer': 256}, 'optimizer': {'@optimizers': 'Adam.v1', 'beta1': 0.9, 'beta2': 0.999, 'L2_is_weight_decay': True, 'L2': 0.01, 'grad_clip': 1.0, 'use_averages': True, 'eps': 1e-08, 'learn_rate': {'@schedules': 'warmup_linear.v1', 'warmup_steps': 250, 'total_steps': 20000, 'initial_rate': 5e-05}}, 'score_weights': {'tag_acc': None, 'dep_uas': None, 'dep_las': None, 'dep_las_per_type': None, 'sents_p': None, 'sents_r': None, 'sents_f': None, 'lemma_acc': None, 'ents_f': 0.16, 'ents_p': 0.0, 'ents_r': 0.0, 'ents_per_type': None, 'speed': 0.0}}

If I don't use the --training.batch_size it goes ahead and runs.

Just as a test, I ran your command line exactly as input and I got the same error message about "extra fields not permitted". Just for clarity I am using version 1.11.7 of Prodigy on Ubuntu 21.10 and python 3.9.7 and spacy is 3.2.3.

prod

Ah sorry, I had just copied over your overrides without actually checking that they're valid config settings. spaCy's config doesn't actually have a batch_size setting in the [training] block: https://spacy.io/api/data-formats#config

So it's expected that spaCy complains here. You can customise the batching for nlp.pipe and nlp.evaluate via nlp.batch_size, and the batching strategy during training via training.batcher: https://spacy.io/api/top-level#batchers