Invalid config override 'school_data': name should start with --

I'm following this tutorial - https://www.youtube.com/watch?v=59BKHO_xBPA - and have successfully run the annotation stage (with the difference being that I used my own data rather than the reddit data, and I called my dataset school_data rather than food_data):

prodigy ner.manual school_data blank:en texts_containing_school_stuff.txt --label SCHOOL --patterns ./school_problems/school_pattern_file.jsonl

I've just tried running the next step, which is to run the following:

prodigy train ner school_data en_vectors_web_lg --init-tok2vec ./tok2vec_cd8_model289.bin --output ./tmp_model --eval-split 0.2

...and I'm seeing the following error:

Invalid config override 'school_data': name should start with --

I've triple-checked that the commands I've written are the same as those in the tutorial video (except for the name of my dataset of course) and I can't see what I've done wrong. I've searched for a similar support ticket and can't find one.

In case it helps, here is a pic of the annotation UI at the end of my annotation stage, showing the prodigy version at the bottom.

Can anyone help me to correct whatever I've done wrong?

PS - Thank you prodigy team for a great tool and useful video tutorials! I find them so much more useful than documentation (for getting started with a new tool at least) :hugs:

1 Like

What happens if you add the -- before the name as suggested in the error.

I am not an expect at all but I do know that the prodigy team has updated a lot of libriaries and codes since March 2020 when the video was recorded. Probably there are some code structures that were faded out or added in.

Yes, when that video was recorded, the train command was a lot less flexible and only supported training one component. We've since changed it to support more use cases and training multiple components at the same time, so the command usage has changed slightly: https://prodi.gy/docs/recipes#train

So in your case, you want to do:

prodigy train --ner school_data ./tmp_model --eval-split 0.2

To include the vectors and tok2vec weights, you can now define them all in one place in the config.cfg in spaCy v3. You can use this widget to auto-generate a config file for an NER model: https://spacy.io/usage/training#quickstart In the [initialize] block, you can then define all settings for initialising the model, including the vectors to use and the pretrained tok2vec weights. You can then provide the config file to Prodigy via the --config argument.

[initialize]
vectors = "en_core_web_lg"
init_tok2vec = "./tok2vec_cd8_model289.bin"
2 Likes

Thank you so much :pray:

Thanks @ines prodigy train --ner school_data ./tmp_model --eval-split 0.2 worked and I got an F score of 87.02 :smiley:.

Am I right in thinking that the training algorithm uses some defaults for the vectors and init_tok2vec when I don't list those in the command or add them to the config.cfg file? If so, where can I find details on what those are and how they differ to the ones used in the tutorial? Am I likely to get better results by setting those configs to the options used in the tutorial?

Also, I've had a look at the documentation about the config.cfg file but I'm still not sure where I can find that file or whether I need to create it (and if so where and what I need to put in it other than the [initialize] block that you recommended). Please could you help me to understand what I need to do, in idiots-guide terms?

Apologies for the super newbie questions. I've learned to program in python for doing data analysis and ML but wouldn't say I'm a proper programmer so there are probably some basic bits of knowledge that I ought to have but don't yet. I want to make sure I fully understand what I'm doing as I'll be using prodigy for my PhD project.

That definitely sounds promising :blush:

You can follow the instructions here to auto-generate a config with all settings for the pipeline you want to train: Training Pipelines & Models ยท spaCy Usage Documentation You can either use the quickstart widget on the website and fill it with all settings, or run spacy init config on the CLI, for example:

python -m spacy init config ./config.cfg --lang en --pipeline ner

The config includes all settings for training and configuring your model, so you (and anyone else) will always be able to re-run the training with the exact same configuration and reproduce your results.

If you have a GPU available for training, you could also experiment with initialising your model with transformer embeddings โ€“ this will end up with a larger model, but it might give you another nice boost in accuracy.

1 Like

Thank you so much @ines :hugs:

1 Like