Custom database connection being reset to prodigy.json paramters

I have a postgres instance with two databases in it: my_annotated_db and my_input_db and I have custom stream manager that will watch my_input_db for new input, perform some custom operations, and then feed the stream over to prodigy for annotation. My prodigy.json file points to the my_annotated_db as:

{
  "db": "postgresql",
  "db_settings": {
    "postgresql": {
      "host": <my_host>,
      "dbname": "my_annotated_db",
      "user": <my_username>,
      "password": <my_password>,
      "port": 5432
    }
}

In my StreamManager I build a connection to the input database as:

from prodigy.components.db import connect
db_config = {
  "host": <my_host>,
  "dbname": "my_input_db",
  "user": <my_username>,
  "password": <my_password>,
  "port": 5432
}

my_input_db = connect(db_id="postgresql", db_settings=db_config)

This all seems to work fine, i.e., there are no errors. Furthermore, if I immediately check the datasets and connection information using: my_input_db.db.database, my_input_db.db.connect_params, and my_input_db.datasets everything looks correct. But when the stream goes to extract data at a later time things have changed! It seems like prodigy has dropped the connection and reconnected with the prodigy.json parameters. I can close the connection and reconnect, but this usually isn't sufficient.

Is there a better way to handle two connections? Is there a reason why this might be happening? Maybe some internal connection timeout (though the drop seems to be almost immediate) or some connection pool that is defaulting to prodigy.json parameters?

For now I've worked around this by reverting to a single database and using fields in the dataset meta to distinguish. However, I would be curious to know what the root cause is and whether it's possible to work with multiple connections to different databases through prodigy.

Hi @FourthPartyAI,

Apologies for the delay in response!
The root cause of what you're observing is that Prodigy's db proxy is a global variable which means that it's being updated with every call to connect method. This means you can have only one active db connection at a time.
This is a recommended peewee setup and it is meant prevent issues with multiple open connections and to ensure that all parts of the application are using the same database connection.
Please see the yellow warning box in connect method docs for more details.
One way to work around it would be to use DB connection sequentially by calling .close() on the opened input connection once it is not needed. You could use context manager to make it more explicit in code:

from contextlib import contextmanager
from prodigy.components.db import connect
from prodigy.component.stream import Stream

@contextmanager
def database_connection(db_id, settings):
    db = connect(db_id, settings)
    try:
        yield db
    finally:
        db.close()

input_db_settings = {(...)}
with database_connection("postgresql", input_db_settings) as input_db:
    examples = input_db.get_dataset_examples("dataset2")
    stream = Stream.from_iterable(examples)

but this obviously doesn't change the fundamental behavior of Prodigy's global db_proxy. Each call to connect() will still reinitialize the global proxy, and close() only closes the connection but doesn't "uninitialize" the proxy.
So while the context manager helps with connection cleanup, it doesn't solve the underlying limitation of only being able to have one active connection at a time. It just makes the connection lifecycle more explicit and manageable.
The remaining options left, as I see it is to:

  1. work directly with peewee or another database library, but that would mean losing the benefits and functionality that Prodigy's DB layer provides.
  2. store both input and output in a single DB like you did - which is what I would recommend.