How to connect to a remote postgres server using db.connect?

I'm trying to connect to a remote PostgreSQL database server using the prodigy.components.db.connect method, as Ines advised in her reply to Post #108. When using prodigy.json to configure the db settings, all works well. But I'm unable to connect manually (e.g. from a custom recipe). The post is split into three parts (1) shares the (fake) database settings, (2) shows what works, (3) shows what doesn't work.

§1. Database

Suppose there is an accessible, remote PostgreSQL server with the following settings:

'host': 'db.com'
'dbname': 'prodigy'
'user': 'user'
'password': 'password'

§2. What works

When I add those settings to prodigy.json, the connection is established and I'm able to view the available datasets.

# prodigy.json
{
    "db": "postgresql",
    "db_settings": {
        "postgresql": {
            "host": "db.com",
            "dbname": "prodigy",
            "user": "user",
            "password": "password"
        }
    }
}

# terminal
>>> from prodigy.components.db import connect
>>> db = connect()
>>> db.datasets
['', 'dataset_1', 'dataset_2']               # <-- Great!

§3. What does not work

When I remove those settings from prodigy.json, and repeat the same steps, something goes wrong. I can't figure out what I'm missing.

# prodigy.json
{
    # ...nothing on 'db'...
}

# terminal
>>> from prodigy.components.db import connect
>>> db = connect('postgresql', {
         'host': 'db.com',                   # <-- see (Note 1)
         'dbname': 'prodigy',
         'user': 'user',
         'password': 'password'})
>>> db.datasets
[]                                           # <-- Not great

Note 1. Ines didn't mention the host entry in her response to Post #108, but I've added it here because otherwise I don't see how connect will know about the db server address. I did try adding just the host to prodigy.json, leaving the dbname, user, password to be specified in the recipe. But no success there.

I'd dig a little more into this before posting, but the documentation is light on this topic (only PRODIGY_README.html#postgresql seems relevant, but doesn't help) and the source for what exactly db.connect does seems to be unavailable. Any help would be much appreciated.

Thanks for the details and analysis – much appreciated! (Disclaimer: We’re not exactly database experts, so sorry if this is still a bit rough. We’ve also been meaning to include the uncompiled source of Prodigy’s database module to make this a little more transparent and easier to customise.)

For PostgreSQL, the database settings can include any available psycopg2 connection parameters – which seem to also allow a port and a host. So in the best case scenario, this might even be enough to make it work with a remote server.

Another missing piece: Prodigy uses the peewee package for its ORM, so anything peewee can do, we can potentially make happen in Prodigy. The section on PostgreSQL also mentions the playhouse extension which we could integrate, if it includes additional features that are useful or necessary to make Prodigy work for different configurations. (Btw, when looking at the peewee docs, keep in mind that peewee’s connect() != Prodigy’s connect(). Maybe the naming we choose here is bad, but we liked the simplicity and then again, in the DB world, everything is called “connect” anyways…)

Prodigy’s db.connect() doesn’t actually do that much – it takes care of some basic validation, initialises the respective database and returns an instance of the database class, prodigy.components.db.Database. The Database class is currently initialised with the database ID, e.g. “postgresql”, and the database settings, and then resolves that to the respective database module provided by peewee, e.g. peewee.PostgresqlDatabase.

To be honest, I just realised that this is actually very bad default behaviour. Instead, db.Database should take the instance of the database, and all the validation, mapping of the arbitrary database ID etc. should happen in the connect() helper. This would let you plug in your own PostgresqlDatabase instance – e.g. a modified version of the peewee default, or the PostgresqlExtDatabase version provided by the playhouse package. Will fix this for the next release! :+1:


Edit: Just made some changes to the database handling, so in the upcoming version of Prodigy, you’ll be able to do something like this:

from prodigy.components.db import Database
from playhouse.postgres_ext import PostgresqlExtDatabase

psql_db = PostgresqlExtDatabase('my_database', user='postgres')
db = Database(psql_db, 'postgresql', 'Custom PostgreSQL Database')

Instead of the ID and settings, the Database class will then take an instance of peewee.Database (i.e. any of the classes provided by peewee or its extension package Playhouse). The other arguments are an optional display_id and display_name, mostly used for logging. They default to 'custom' and the function/class name or repr(db) for custom databases.

1 Like

Thank you very much! Looking forwarding to the enhancement.

1 Like

Btw, in the meantime, a hacky workaround could be something along the lines of this:

from prodigy.components.db import connect, DB_PROXY
from playhouse.postgres_ext import PostgresqlExtDatabase

# connect to default SQLite database or any other DB 
# hack to get an instance of of prodigy.components.db.Database
db = connect()

# get the new, custom database, e.g. the extended Playhouse PostgreSQL
psql_db = PostgresqlExtDatabase('my_database', user='postgres')

# initialise database using DB_PROXY, an instance of peewee.Proxy()
# and overwrite the database's self.db attribute with the new one
db.db = DB_PROXY.initialize(psql_db)

If we didn’t mess anything up and it works as expected, this should give you a valid db that can be returned by your recipe.

1 Like

Thank you so much for the workaround! It worked wonderfully after installing something called an hstore extension for the database. In case it’s useful to others I’ll mention how to do that:

# connect to the database
psql -h [HOST] -U [USER] [DBNAME]  # e.g. psql -h db.com -U user prodigy

# install the extension
[DBNAME]=> create extension hstore;

I’m using an aws-provisioned PostgreSQL RDS instance with the host extension enabled. Your setup may require enabling the extension to be able to replicate the step above.

Although I was able to connect to the database with the workaround, once I implemented the changes inside a recipe (subbing the proxy db in), started getting errors like this:

# Example 1 -- during annotation
File "cython_src/prodigy/components/db.pyx"
    line 183, in prodigy.components.db.Database.add_examples
AttributeError: 'NoneType' object has no attribute 'atomic'

# Example 2 -- after exiting the session
File "cython_src/prodigy/components/db.pyx"
    line 226, in prodigy.components.db.Database.save
AttributeError: 'NoneType' object has no attribute 'commit'

I’m guessing the proxy doesn’t support the same set of methods that the default db does? Here’s my somewhat obfuscated recipe, so that we’re on the same page.

# custom_recipe.py
from prodigy.components.db import connect, DB_PROXY
from playhouse.postgres_ext import PostgresqlExtDatabase

# make the default connection
db = connect()

# create a proxy db
psql_db = PostgresqlExtDatabase('prodigy',
    user='user', password='password', host='db.com')

# sub the proxy db in
db.db = DB_PROXY.initialize(psql_db)

@recipe('custom')
def custom():
    return {
        # ...
        'db': db,
    }

Everything works wonderfully now (and there is no longer a need for the workaround). After installing the licensed 1.1.0 version, I’m able to keep the configuration in the environment:

export PRODIGY_DB_NAME='[name]'
export PRODIGY_DB_HOST='[host]'
export PRODIGY_DB_USER='[user]'
export PRODIGY_DB_PASSWORD='[pass]'

and then inside a recipe can use this to initialize the database in exactly the way you described above:

import os
from prodigy.components.db import Database
from playhouse.postgres_ext import PostgresqlExtDatabase

psql_db = PostgresqlExtDatabase(
    os.getenv('PRODIGY_DB_NAME'),
    user=os.getenv('PRODIGY_DB_USER'),
    password=os.getenv('PRODIGY_DB_PASSWORD'),
    host=os.getenv('PRODIGY_DB_HOST'))

@recipe('custom')
def custom():
    # ...
    return {
        # ...
        'db': Database(psql_db)
    }

Thank you very much for the advice, for the enhancement and for the inspiring work you do.

@hunan-rostomyan Thanks a lot for updating and sharing your code – glad to hear it’s working so smoothly now with the improved Database class.

If you don’t mind, I’d love to add your example to the docs to illustrate the more advanced database usage. I’m sure this would be very useful for others as well!

Of course. Feel free to add as an example.

1 Like