Train for NER

Hi, I am currently working on a NER model that analyzes articles written in Russian.

:small_orange_diamond:At this stage, it was possible to create the first working models on a small amount of data, but there were many questions about training and extending models. I know that there have been several similar discussions here, but none of them helped me (

This is my basic code for training "!python -m prodigy train --ner title_ner ru_core_news_sm --eval-split 0.2"

:small_orange_diamond:I will try to ask all the questions in this branch:

  1. When I add “--gpu-id” to explore whether learning will be faster, I'm getting an error “ValueError: GPU is not accessible. Was the library installed correctly?”. The answers in this topic are not clear, you maybe you have code notes to determine if my code is correct?

  2. When I add “output_dir” I'm getting an error “NoSuchOption: no such option: --output_dir”.
    However, when training is conducted, I receive a message “:heavy_check_mark: Saved pipeline to output directory ru_core_news_sm\model-last”, Does this mean that storage is done automatically? But if you do several trainings, it saves everything in one folder, which is not convenient and it is not clear how to extract this data.

  3. How to save the model correctly, and how to look at it to see if it annotated the data correctly. For example, I have a dataset for ner, I trained and got good accuracy and other parameters, how can I see the annotated model and continue to work with it to demonstrate the dependence of quantitative indicators. You may have notes on subsequent applications of data after ner training?

  4. Another example: I divided the dataset into 2 parts (“small” 500 records. and “large” 10000 records)
    for small: 1) made annotations “ner.manual” 2) train the model 3) improved the performance of the model (ner.correct, 2-nd train).
    How to train a “large” dataset now, having a model trained on a “small” one?

The answers to these questions may be obvious, but I have not yet been able to resolve them(

Prodigy version - 1.11.7
Python Version - 3.9.7

I hope for your help, thank you

Hi @Vadym_Kostyuk !

  1. Do you have cupy installed? For this, you might need to double-check your cupy installation. Make sure that the CUDA toolkit installation is also correct and is compatible with your GPU.
  2. So output_dir in prodigy train is a positional argument. This means that you cannot pass a keyword parameter like --output-dir. We just need to do a minor adjustment in your command:
python -m prodigy train path/to/model --ner <dataset> --base-model ru_core_news_sm --eval-split 0.2
  1. So if you run the command above, it will produce a model output in the said directory. This directory will contain model-last and model-best. These can then be treated as any other spaCy model. You now have two choices on what to do with this:
  • You can load the best model similar to how you load any spaCy model, then try it out with your test data (e.g., spacy.load("path/to/model/model-best")). From this you can then perform any other linguistic analysis you'd like to try.
  • You can use ner.correct for double-checking your model, or use ner.teach to improve your annotations.
  1. If I understand your question correctly, you want to resume training with your larger set of examples. You can do this by passing the model path to --paths.vectors parameter in the overrides positional argument of prodigy train.

thanks for the quick reply, and following your advice I have another question

  1. You helped, it turned out that I installed the packages incorrectly.

  2. when I try as in your version I get an error

  3. This is how my training code looks
    If I understand correctly, instead " ru_core_news_sm", I can specify the path to the trained model?( and I have to clarify the code to the folder or to some specific element of the folder?)

  4. I unfortunately did not understand this, can you describe this point in more detail or show code examples?
    My task is that: first trained my labels on the manual annotations, the learning rate is ok for me, and in the future I want to use these parameters for a model in which there are many examples of text.

  5. After training, I need to collect annotated data to show certain dependencies and number of label values, how can I do that?

Sorry for so many questions and clarifications, I just really want to understand all the details of Prodigy

Hi @Vadym_Kostyuk ,

Sorry for just getting into this. To answer your questions

For #2, you forgot to include the --base-model parameter:

python -m prodigy train <output_dir> --ner title_ner --base-model ru_core_news_sm

Remember to provide the output directory first. You can find more information about prodigy train by running prodigy train --help.

For #3, yes you can specify a path to the model. It should just be a directory. Also, check the inside of that model directory as it contains both model-best and model-latest.

For #4, you can override commands in your config. For example:

python -m prodigy train <output_dir> --ner title_ner --base-model <model> . --vars.vectors <some-model>

For #5, can you expound further what you mean by this? If you want to export the annotated data, you can use the db-out command.