Hello, I am looking to connect my Postgres DB to and from Prodigy, so that all emails are sent for labeling and all labeling is properly stored for training.
Can you please share how I could do so? Happy to email / connect further to learn. Any documentation is also helpful!
In case it helps, we've even recently created a deployment docs section that show a bit more detail of how to containerize your Prodigy session with a hosted PostgreSQL database (in our example we use Digital Ocean):
If deploying on and putting your credentials on something like git, be sure to read our suggestions in those docs for setting the .env file to avoid putting any sensitive secrets (e.g., DB password) hard coded onto a prodigy.json that stored on git.
While it can be a good idea to store the prodigy.json file in your git repository, you don’t want to add the database credentials to it. The password is purposefully kept out of the prodigy.json file in this example and will instead be filled in by a script that reads from a environment variable.
The host and user are also good candidates for environment variables, but we’ve kept these variables in the example to make it easier to explain the required parameters.
For next steps, if I create a remote DB and add the values to prodigy.json file, what would I need to do, in order to complete the DB connection to/from Prodigy?
Alternatively, how could I test that the labeled annotations are properly being added to the DB & how can I pull them all for training?
It should do so automatically so long as your prodigy.json with your DB parameters is in the right location. As the docs mention, first Prodigy will look at your Prodigy Home (e.g., run prodigy stats to find that location). That's your "global" config file. Next, it'll look at your current working directory, i.e., where you're running the prodigy command. That's your "local" config file. The "local" will override anything in the "global". Last, you could add in your DB parameters as global overrides, which would then override your "local" and "global" configs.
My recommendation is that if you expect to always use the same DB, just put those into your global config (aka your Prodigy Home directory) and it'll always be used automatically when you run any commands.
The easiest first step would be setup your prodigy.json, then run prodigy stats and you should now see an updated database name and id.
So perhaps just try with a dummy dataset to annotate. Then you could directly see if the annotations are in your Postgres database. Alternatively, you could use the built-in database components to connect directly to the database, see these docs for details.
You can also use Prodigy logging which should show some details about your database connection to make sure you're annotating.
Once it's there, you can just pull out the annotations and train as you deem fit -- not sure if you're using Prodigy's built-in spaCy tools for training or some other framework. You could also use db-out which is a Prodigy recipe to export out your annotations.
When testing that the labeled annotations are properly being added to the DB & when pulling for training, I see that there is a command "db-in" and "db-out".
Is there a way to use "db-in" and "db-out" programmatically via Python? Put another way, how would I be able to call "db-in" and "db-out" via Python?
from prodigy.components.db import connect
db = connect()
all_dataset_names = db.datasets
examples = db.get_dataset_examples("my_dataset")
Or as I mentioned previously, you can view exactly what db-in and db-out do by looking within the installed package at their recipes. Look in your Location: folder from prodigy stats and look for recipes/commands.py script. Hope this helps!
It appears that my "Location" folder does not exist within my system? I have attached the terminal as I searched for that folder in my system. Is there another way to find the "recipes/commands.py" script ?
Sorry, I may not have been clear. By Location folder I meant the folder shown when you run prodigy stats. For example:
$ python -m prodigy stats
============================== ✨ Prodigy Stats ==============================
Version 1.14.4
Location /opt/homebrew/lib/python3.9/site-packages/prodigy
Prodigy Home /Users/ryan/.prodigy
Platform macOS-14.0-arm64-arm-64bit
Python Version 3.9.17
Spacy Version 3.6.0
Database Name SQLite
Database Id sqlite
Total Datasets 112
Total Sessions 322
Then you can just use open in your command line to open up that folder:
$ open /opt/homebrew/lib/python3.9/site-packages/prodigy