Configure a non-sqlite database (e.g. postgres) without storing the password in prodigy.json

Is there a way to configure non-sqlite db settings without using the prodigy.json file? Currently, we do:

"db_settings": {
    "postgresql": {
        "host": "$PRODIGY_DB_HOST",
        "dbname": "$PRODIGY_DB_NAME",
        "user": "$PRODIGY_DB_USER",
        "password": "$PRODIGY_DB_PASSWORD"
    }
},

Question: Is there a way to specify the password, at least, by some other means? (I’d like to avoid pushing sensitive information to the source repository.)


Suggestion. One idea I’ve hinted at in the example above (use something like "$PRODIGY_DB_PASSWORD" instead of the "[actual_password]").

Sure, that’s no problem! What’s your preferred way of handling passwords? Environment variables?

Another thing you could do, which gives you even more flexibility, is to connect to the database in Python. Prodigy exposes a connect function which takes the database type and a dictionary of database settings as its arguments:

from prodigy.components.db import connect
db = connect('postgresql', {'dbname': 'xxx', 'user': 'xxx', 'password': 'xxx'})

You can pass a custom DB to Prodigy as the 'db' key of the components dictionary returned by a recipe – either one created by the connect() function, or an entirely custom one that follows the same API (see the readme for the API documentation).

If you’re using one of Prodigy’s built-in recipes, you can also wrap it in a custom recipe and just overwrite the database (or execute any other code you like). Recipes are just simple Python functions that return a dictionary of components – so you can import an existing one, pass in the arguments, execute it, receive back the recipe components, overwrite the DB and return it by your custom recipe:

import prodigy
from prodigy.recipes.ner import teach  # import the built-in ner.teach recipe

@prodigy.recipe('ner.teach.wrapper')
def ner_teach_wrapper(dataset, model, source, label=None):
    # pass in the arguments of ner.teach and get back the recipe components
    components = teach(dataset, model, source=source, label=label)
    # this will return a dict like {'dataset': dataset, 'stream': stream} etc.
    components['db'] = db   # overwrite the database with the one you created
    return components  # return the recipe components

Then you can run your recipe just like ner.teach:

prodigy ner.teach.wrapper my_dataset en_core_web_sm my_data.jsonl -F recipe.py

You can read more on this in the PRODIGY_README.

1 Like

That’s how we’re overriding the port config right now, so this solution fits right in. Thank you very much for the detailed explanation and guidance.

1 Like

@ines

When you specified the ner.teach.wrapper are you saving the code in its own .py file or are you adding it to the original ner.py file? How does prodigy know to call the ner.teach.wrapper you are creating? Am I missing something simple?

@adingler711 Yes, it’s the -F argument at the end: -F recipe.py. This will tell Prodigy to load the recipe from a file (path), recipe.py. You can put multiple recipes in one file, or use one file per recipe, whatever you prefer.

ahh brilliant! I must have missed the last part of the command. Thank you, that really helps!

I know it is a bit of late response to this topic, but I just found another method I wanted to share, in case anyone still comes across this. Somewhere in the documentation of psycopg2 they mention briefly:

Also note that the same parameters can be passed to the client library using environment variables.

Thus, configuring the postgresql credentials can be done with the usual PGUSERNAME or PGPASSWORD environment variables. Eliminating the need to create a wrapper for each recipe or putting them in the json configuration :smile:

1 Like

@joell Ohh, thanks for sharing and digging that up, that definitely makes things a lot easier :smiley: I'll add this to the Prodigy docs as well!

This doesn't seem to work for the PGDATABASE environment variable. It looks like even with that env var correctly specified, the underlying built-in connection uses a default DB with the name 'prodigy'.

peewee.OperationalError: FATAL: database "prodigy" does not exist

Might make sense for Prodigy to check and make sure they probe all common env vars for the various configurable DB engines and use them if specified over defaults.

Additionally, having tried to create a custom recipe with the -F option, and using the components["db"] dictionary and passing in a db connection with the correct DB, it also seems to revert to use prodigy DB and we get an exception in the log output.

1 Like

Thanks for the suggestion here. We've had a plan to make the DB configuration easier with more specific env vars as well as looking for commonly supported env vars for a while but haven't gotten to it yet. Is there any blocker to overriding your prodigy.json config with the PRODIGY_CONFIG_OVERRIDES env var?

This variable takes the same JSON format as the prodigy.json file and will overwrite any key in the prodigy.json file. See the docs here for more info.

e.g.

config_overrides = {
    "db": "postgresql",
    "db_settings": {
        "postgresql": {
            "db": "database_name",
            "user": "user_name",
            "password": "user_password",
            "host": "some_remote_host.com"
        }
    }
}

os.environ["PRODIGY_CONFIG_OVERRIDES"] = json.dumps(config_overrides)