How to Connect DB to/from Prodigy

Hello, I am looking to connect my Postgres DB to and from Prodigy, so that all emails are sent for labeling and all labeling is properly stored for training.

Can you please share how I could do so? Happy to email / connect further to learn. Any documentation is also helpful!

hi @wertzhayden!

Thanks for your message and welcome to the Prodigy community!

You can find how to connect your database in the Prodigy database docs. You'll need to modify Prodigy's configuration (aka prodigy.json) file like so:

{
  "db": "postgresql",
  "db_settings": {
    "postgresql": {
      "dbname": "prodigy",
      "user": "username",
      "password": "xxx"
    }
  }
}

In case it helps, we've even recently created a deployment docs section that show a bit more detail of how to containerize your Prodigy session with a hosted PostgreSQL database (in our example we use Digital Ocean):

{
  "db": "postgresql",
  "host": "0.0.0.0",
  "port": 8080,
  "db_settings": {
    "postgresql": {
      "host": "db-postgres-prodigy-do-user-243383-0.b.db.ondigitalocean.com",
      "dbname": "defaultdb",
      "user": "doadmin",
      "password": "******",
      "port": 25060
    }
  }
}

If deploying on and putting your credentials on something like git, be sure to read our suggestions in those docs for setting the .env file to avoid putting any sensitive secrets (e.g., DB password) hard coded onto a prodigy.json that stored on git.

While it can be a good idea to store the prodigy.json file in your git repository, you don’t want to add the database credentials to it. The password is purposefully kept out of the prodigy.json file in this example and will instead be filled in by a script that reads from a environment variable.

The host and user are also good candidates for environment variables, but we’ve kept these variables in the example to make it easier to explain the required parameters.

Hope this helps!

1 Like

Very helpful Ryan, thank you!

For next steps, if I create a remote DB and add the values to prodigy.json file, what would I need to do, in order to complete the DB connection to/from Prodigy?

Alternatively, how could I test that the labeled annotations are properly being added to the DB & how can I pull them all for training?

"port": env("port),
"host": "0.0.0.0",
"cors": true,
"db": "postgresql",
"db_settings": {
"postgresql": {
"host": env("host"),
"dbname": env("dbname"),
"user": "admin",
"password": env("password")
}
},

It should do so automatically so long as your prodigy.json with your DB parameters is in the right location. As the docs mention, first Prodigy will look at your Prodigy Home (e.g., run prodigy stats to find that location). That's your "global" config file. Next, it'll look at your current working directory, i.e., where you're running the prodigy command. That's your "local" config file. The "local" will override anything in the "global". Last, you could add in your DB parameters as global overrides, which would then override your "local" and "global" configs.

My recommendation is that if you expect to always use the same DB, just put those into your global config (aka your Prodigy Home directory) and it'll always be used automatically when you run any commands.

The easiest first step would be setup your prodigy.json, then run prodigy stats and you should now see an updated database name and id.

So perhaps just try with a dummy dataset to annotate. Then you could directly see if the annotations are in your Postgres database. Alternatively, you could use the built-in database components to connect directly to the database, see these docs for details.

You can also use Prodigy logging which should show some details about your database connection to make sure you're annotating.

Once it's there, you can just pull out the annotations and train as you deem fit -- not sure if you're using Prodigy's built-in spaCy tools for training or some other framework. You could also use db-out which is a Prodigy recipe to export out your annotations.

Hope this helps!

1 Like

Thank you for the info and "prodigy stats" step.

When testing that the labeled annotations are properly being added to the DB & when pulling for training, I see that there is a command "db-in" and "db-out".
Is there a way to use "db-in" and "db-out" programmatically via Python? Put another way, how would I be able to call "db-in" and "db-out" via Python?

Excellent question!

Check out our database components:

from prodigy.components.db import connect

db = connect()
all_dataset_names = db.datasets
examples = db.get_dataset_examples("my_dataset")

Or as I mentioned previously, you can view exactly what db-in and db-out do by looking within the installed package at their recipes. Look in your Location: folder from prodigy stats and look for recipes/commands.py script. Hope this helps!

1 Like

Ryan,

It appears that my "Location" folder does not exist within my system? I have attached the terminal as I searched for that folder in my system. Is there another way to find the "recipes/commands.py" script ?
Screenshot 2023-10-19 at 4.18.48 PM

Sorry, I may not have been clear. By Location folder I meant the folder shown when you run prodigy stats. For example:

$ python -m prodigy stats

============================== ✨  Prodigy Stats ==============================

Version          1.14.4                        
Location         /opt/homebrew/lib/python3.9/site-packages/prodigy
Prodigy Home     /Users/ryan/.prodigy          
Platform         macOS-14.0-arm64-arm-64bit    
Python Version   3.9.17                        
Spacy Version    3.6.0                         
Database Name    SQLite                        
Database Id      sqlite                        
Total Datasets   112                           
Total Sessions   322          

Then you can just use open in your command line to open up that folder:

$ open /opt/homebrew/lib/python3.9/site-packages/prodigy

And it'll open in Finder since you're on a Mac.