By default, Prodigy will save your data into a SQLite database. It is saved as a "dataset" with the name you provided in your command. It looks like you provided ner_news_headlines
.
You SQLite database is saved in your Prodigy home location. You can find this by running prodigy stats
and looking for the Prodigy Home location.
$ prodigy stats
============================== ✨ Prodigy Stats ==============================
Version 1.13.0
Location /Users/ryan/Downloads/my_prodigy_project/venv/lib/python3.9/site-packages/prodigy
Prodigy Home /Users/ryan/.prodigy
Platform macOS-13.4.1-arm64-arm-64bit
Python Version 3.9.17
Spacy Version 3.6.1
Database Name SQLite
Database Id sqlite
Total Datasets 96
Total Sessions 274
For me, running a Mac, it is in /Users/ryan/.prodigy
, which includes a file named prodigy.db
.
Yep - let me review a few options you have once you have your annotations.
For the easiest way to view, you can export it as a .jsonl
file using the db-out
recipe (see this example):
prodigy db-out ner_news_headlines > ./annotations.jsonl
However, that's an optional step. If you want to train a model, you have two options:
- use
prodigy train
:
Like this example, you'd then run:
prodigy train --ner ner_news_headlines
You may also want to an (optional, but likely needed) argument of where to save this model, so you'll likely want to run:
prodigy train ./output_dir --ner ner_news_headlines
It's important to know that the prodigy train
is a simplified wrapper for spacy train
. This is nice when you're beginning, as we'll use a default config so you don't have to worry about the details about spacy train
and spaCy config files, which can be a bit complex at first.
- For more intermediate to advanced users, alternatively you can use
data-to-spacy
then spacy train
directly when you want to train.
For example:
prodigy data-to-spacy ./corpus --ner ner_news_headlines
This will create a default spaCy config file, a labels txt file, and two spaCy binary files with your annotated data. The two spaCy binary files are for training (train
) and evaluation (dev
). Note, this is a better practice of having a hold out evaluation set, while prodigy train
will randomly partition your data each time. You can then run spacy train
:
spacy train ./corpus/config.cfg --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy
Feel free to try other recipes like train-curve
:
prodigy train-curve --ner ner_news_headlines
Or, once you have a new model in the folder ./output
(or whatever name you give it), you can then access one of the models (e.g., ./output/model-best
) and now use that to annotate more examples, but this time using a correct recipe (e.g., ner.correct
) and specify your new model ./output/model-best
instead of using a Blank English model.
There are a lot of possibilities and workflows you can iterate on.
We've even created this NER annotation flowchart to show you several possible paths (be sure to save the pdf and check out the hyperlinks embedded in many of the decision boxes):
I'd encourage you to keep searching through the docs and support for other users tips and suggestions. This is a good first step but pretty soon, you'll find more advanced things you can do like setup a spaCy project for Prodigy workflow, like this demo project that integrates a lot of the steps I mentioned above into one project.