After NER.correct, how do I train?

Hi,

I'm working on retraining one of Spacy's Spanish models ability to find locations (NER). I have finished annotating with ner.correct, and I think that the annotations have been saved to the hidden .prodigy folder in my home directory. If this all makes sense, I would now like to retrain the model and save it so I can call it on a Python notebook. Does the following command make sense:

prodigy train /Users/modeedna/Desktop/Users/modeedna/.prodigy/prodigy.db es_core_news_sm

The order is prodigy, train, output directory, database with annotations, model to retrain. Am I missing anything?

On a separate note, my initial command for ner.correct was:

prodigy ner.correct ner_data es_core_news_sm /Users/modeedna/Desktop/INEGI/ProyectoNER/train_punt.jsonl --label LOC

But the file didn't save with ner_data as a name. Or perhaps I can't find it. The only file that I saw saved in the hidden folder was prodigy.db, which I believe does have the annotations from ner.correct. If this makes sense, why didn't it save as ner_data?

Thanks for the help!

hi @ModeEdna!

If you are trying to re-train using a base-model (es_core_news_sm), you don't need to provide the path to your database. Instead, you'd provide the name of your dataset, which is where you saved your annotations in that database.

For example, since you ran:

python -m prodigy ner.correct ner_data ...

Then to train a ner model, you'd run:

python -m prodigy train output_path --ner ner_data --base-model es_core_news_sm

Where output_path is where you want to save your model.

Also, since you're using the --base-model with only updating 1 entity type, be aware that for other entities (outside of LOC) you'll likely run into catastrophic forgetting. That's where the model will tend to forget information about other entities it hasn't been retrained with. Another alternative is to train on a blank model. Search or I can recommend related posts.

ner_data is a dataset name in your SQLite database which is stored in prodigy.db. For example, if you use a tool like SQLite viewer for the prodigy.db file, you can see the underlying tables. But what's more important is that there is a dataset named ner_data that then you can use with a variety of other Prodigy recipes or components.

For example:

# export out ner_data to a jsonl file
prodigy db-out ner_data > ner_data.jsonl
# use database components to export out ner_data and print first record
from prodigy.components.db import connect

db = connect()
all_dataset_names = db.datasets
examples = db.get_dataset("ner_data")
print(examples[0])

Does this make sense?

In order from top to bottom...

Ok, I understand that part. I'll try it out and get back to you on my progress :slight_smile:
I wasn't aware of the catastrophic forgetting, so thank you for bringing that up. My project only needs to predict LOC entities, so I think I can move forward with retraining the model using only LOC training data.

SQLite part makes sense. Just to confirm that I understood it: if I were to create more training annotations and named them TRAIN2, then the TRAIN2 annotations would also be saved in the prodigy.db file. So, I would have two training annotation sets within the prodigy.db file, and I would need to specify which one I want to use (can use multiple if needed). Correct?

Thanks again for the help!

Correct. If you have multiple datasets -- let's say TRAIN and TRAIN2, you can also use the prodigy db-merge command (see docs) to combine them into one dataset.

For example:

prodigy db-merge TRAIN,TRAIN2 ner_train

Then you can run prodigy train --ner ner_train which combines both of the datasets.

One additional thing you may want to consider is creating two annotation datasets, one for your training and one for your evaluation. If you don't, when you run prodigy train --ner ner_dataset, Prodigy will randomly split ner_dataset into a train/eval dataset. The problem is each time you run prodigy train --ner ner_dataset, it may have a different eval dataset so your results are slightly off each time but it is really just having a different train/eval partition.

The advantage of having two annotation datasets -- one for training and one for evaluation -- is that in prodigy train you can specify both of them with eval: (see docs) by running prodigy train --ner ner_train,eval:ner_eval, where ner_train is the dataset name for your training data and ner_eval is the dataset name for your evaluation dataset. The nice thing with this is each time your run you'll have the same evaluation (hold out) dataset. This makes experimenting much easier as you're using the same evaluation set each time.

Ok, that makes sense. I'll make sure to do so!

Last question: once the model was trained and saved to the output directory, how do I use it within a notebook? I'm assuming I should import it similar to a Spacy model but by directing it to the model file in my computer.

Thanks again for the help :slight_smile:

Yep! You can load it and run it like:

import spacy
# output_path is the output location from prodigy train
nlp = spacy.load("output_path/model-best")

doc = nlp("This is a sentence you want to test.")

for entity in doc.ents:
    print(entity.text, entity.label_)

Also you may be interested in using an app like streamlit to show your model too. The spacy-streamlit repo has components that allow you to better visualize any spaCy model (e.g., like en_core_web_sm or your custom model). You can run this locally or deploy it in a cloud environment so that you can share with users through a URL/link.

There's also a spaCy project template that uses spacy-streamlit. The project.yml file enables you to write up commands to organize your spaCy, Prodigy, or whatever else scripts into a project. Or you can just take a look at the visualize.py script, which uses typer so you can run:

streamlit run scripts/visualize.py output_path/model-best

Awesome, I got it to work in the notebook. I'll take a look at the streamline post; could be useful for what I'm trying to do.

I really appreciate the help and patience!

1 Like