I am using prodi.gy for almost two years and am very content with this tool and the developments! I had created a standard workflow for my usage purposes. After one year, I unfortunately cannot replicate this workflow, so I am asking for help.
First of all, I created an annotated dataset. The first step was successful:
Then, I'd like to train a NER-Model from scratch. I usually used this command:
(text) C:\Users\MyName>python -m prodigy train ner annotated_entities18 blank:de --output NER_18
Using CPU
✘ Invalid config override 'annotated_entities18': name should start with --
Unfortunately, this doesn't work anymore. I have also used tried some modifications since I have read the documentations again... but I want to replicate the old command 100%. The NER-Model should use the blank:de model!
Hi! I think the problem here is that the usage of the train command has changed slightly in v1.11 to support training multiple components at the same time (e.g. an NER model and text classifier together) and to integrate with spaCy v3.
Thanks Ines for your quick reply! Is the trained NER model equivalent to the NER model of the previous command? I have really just used the spacy blank:de model, no tokenizer or whatever... Am a bit confused about this, but If it's the same command, then I will continue to use it this way
Yes, the basic setup will be the same – setting --lang de is equivalent to starting out with the blank:de language which just includes the default German tokenizer and no components.
That said, the model you train with the latest Prodigy and spaCy v3 won't be compatible with spaCy v2.
Thanks a lot. One last question: I'd like to load my own NER-model (which I have trained on the previous iteration). I'm again confused how the new commands are working...
In that case, you can just use the --base-model argument, e.g. --base-model /path/to/your/model. In general, we'd recommend retraining a model on all annotations from scratch, rather than updating a previously trained artifact, since this will give you better and more reliable results. But the base model setting can still be useful if you want to update one of the trained pipelines provided by spaCy etc.
The --lang parameter is meant to select a tokeniser when no base model or config are passed. I think in your case, you'd want to use the --base-model parameter to pass en_core_web_lg along.
The --init-tok2vec is something you can set from the command line, but it does need to be a parameter that's available from your configuration file. If you don't configure a custom file yourself, Prodigy will assume a config file like the one generated here. After filling in the missing params via spacy init fill-config I do see a init_tok2vec parameter. This is making me wonder if this might be one of those scenarios where it's really case sensitive. Could you try --init_tok2vec instead of --init-tok2vec?
Just to double-check. Are you using a custom config here? If so, could you share your config.cfg that you're using?