Commands for training NER-Model in prodigy

Hi all,

I am using prodi.gy for almost two years and am very content with this tool and the developments! I had created a standard workflow for my usage purposes. After one year, I unfortunately cannot replicate this workflow, so I am asking for help.

First of all, I created an annotated dataset. The first step was successful:

(text) C:\Users\MyName>python -m prodigy ner.manual annotated_entities18 blank:de datapath/file.csv --label IDENTIFICATION
Using 1 label(s): IDENTIFICATION

:sparkles: Starting the web server at http://0.0.0.0:8080 ...
Open the app in your browser and start annotating!

:heavy_check_mark: Saved 315 annotations to database SQLite
Dataset: annotated_entities18
Session ID: 2022-01-18_13-39-53

Then, I'd like to train a NER-Model from scratch. I usually used this command:

(text) C:\Users\MyName>python -m prodigy train ner annotated_entities18 blank:de --output NER_18
:information_source: Using CPU

✘ Invalid config override 'annotated_entities18': name should start with --

Unfortunately, this doesn't work anymore. I have also used tried some modifications since I have read the documentations again... but I want to replicate the old command 100%. The NER-Model should use the blank:de model!

I would appreciate any help :blush:

Hi! I think the problem here is that the usage of the train command has changed slightly in v1.11 to support training multiple components at the same time (e.g. an NER model and text classifier together) and to integrate with spaCy v3.

You can see the new command usage and available arguments here: Built-in Recipes · Prodigy · An annotation tool for AI, Machine Learning & NLP

So your training command could now look like this:

python -m prodigy train ./NER_18 --ner annotated_entities18 --lang de

Thanks Ines for your quick reply! Is the trained NER model equivalent to the NER model of the previous command? I have really just used the spacy blank:de model, no tokenizer or whatever... Am a bit confused about this, but If it's the same command, then I will continue to use it this way :blush:

Best regards

Yes, the basic setup will be the same – setting --lang de is equivalent to starting out with the blank:de language which just includes the default German tokenizer and no components.

That said, the model you train with the latest Prodigy and spaCy v3 won't be compatible with spaCy v2.

1 Like

Thanks a lot. One last question: I'd like to load my own NER-model (which I have trained on the previous iteration). I'm again confused how the new commands are working...

In that case, you can just use the --base-model argument, e.g. --base-model /path/to/your/model. In general, we'd recommend retraining a model on all annotations from scratch, rather than updating a previously trained artifact, since this will give you better and more reliable results. But the base model setting can still be useful if you want to update one of the trained pipelines provided by spaCy etc.

1 Like

Thank you very much Ines. This was my plan :blush:

A follow up, where is --init-tok2vec option now?

prodigy train ./tmp_model --ner qqs_s2v_data --lang en_core_web_lg --init-tok2vec ./model0.bin --eval-split 0.2

Gets

 No such option: --init-tok2vec

I'm basically following your video about NER @ines

Actually, specifying en_core_web_lg does not work either :frowning: the command that at least run is this

prodigy train ./tmp_model --ner qqs_s2v_data --lang en --eval-split 0.2

but how to start with en_core_web_lg and pretrained vectors then?

Hi there!

A few details about the train command.

  • The --lang parameter is meant to select a tokeniser when no base model or config are passed. I think in your case, you'd want to use the --base-model parameter to pass en_core_web_lg along.
  • The --init-tok2vec is something you can set from the command line, but it does need to be a parameter that's available from your configuration file. If you don't configure a custom file yourself, Prodigy will assume a config file like the one generated here. After filling in the missing params via spacy init fill-config I do see a init_tok2vec parameter. This is making me wonder if this might be one of those scenarios where it's really case sensitive. Could you try --init_tok2vec instead of --init-tok2vec?

Just to double-check. Are you using a custom config here? If so, could you share your config.cfg that you're using?

1 Like