Database performance/connection issues with a remote postgresql DB

kab · June 9, 2023, 5:24pm

Thanks for all the extra info Keoni! The 24million sessions thing is definitely a red flag and beyond the scope of what I was testing. That certainly makes some more sense as to why you have so much RAM usage.

I'll try working this amount of sessions into my testing and yes, we'll definitely have an option in v2 for returning a Python generator when querying examples from a dataset. We're updating our schema for Prodigy v2 to fix this but have also made a bunch of Database performance fixes in the newly released v1.12 alpha versions. See this post for installation:

This implements the query.count fix as well as uses Peewee's query.iterator() functionality on queries. It does gather still gather examples into a list, returning that list for now so as not to break the Database API contract.

The next alpha version we'll be releasing next week will also have improvements to inserting data that should help a lot for commands like prodigy db-in.

For deleting sessions:

Annotated examples are saved in Prodigy and linked to 2 datasets:

the main dataset configured for the annotation server (this is the name you supply as a CLI argument when starting Prodigy)
the session dataset associated with the configured session_id from using the ?session=MY_SESSION_NAME query parameter supplied in the browser or a session_id automatically created with a timestamp of when the annotation session was started.

Since the same examples get linked to both datasets, you can safely delete those sessions and you'll lose the database integrity of who annotated what but still retain the examples and the fact that they belong to a specific main dataset. If you're using named sessions, the saved example JSON should have a _session_id attr added to it which you can likely use to figure out who annotated what but that's not guaranteed to work correctly for the different feed_overlap settings.

Disclaimer out of the way, if you want to clean up these sessions you totally can:

from prodigy.components.db import connect


if __name__ == "__main__":

    db = connect()

    # This is just here to validate the main dataset exists
    dataset = db.get_dataset_by_name("your_main_dataset_name") 

    # This query might take forever to run for you given the number of sessions you have.
    # It does use Peewee query iterators though so it hopefully won't cause 
    # you to run out of memory at any point
    dataset_sessions = db.get_dataset_sessions("your_main_dataset_name")
    print("BEFORE SESSIONS LIST:", dataset_sessions)

    # dataset_sessions is a list of strs with Dataset names.
    # You first need to remove the links to all examples for this 
    # session dataset before you can delete it. 
    # If you want to clear all sessions for a dataset you can do this in a loop.
    db.unlink(dataset_sessions[0])
    
    # The session should no longer have any examples linked to it so you can safely drop it
    db.drop_dataset(dataset_sessions[0])

    # Querying again, the deleted session shouldn't be in this list anymore.
    dataset_sessions = db.get_dataset_sessions("your_main_dataset_name")
    print("AFTER SESSIONS LIST:", dataset_sessions)

Topic		Replies	Views
Slow database operations on PostgreSQL database	1	390	March 11, 2023
How to connect to a remote postgres server using db.connect? enhancement , done , database , aws	8	5732	January 12, 2018
Use AWS RDS PostGres to store annotations database	1	226	August 6, 2023
When trying to use a dataset >50k, prodigy is unable to start up? database	4	498	October 1, 2019
Error in loading a new dataset in PostgreSQL done , database	4	1532	September 15, 2017

Database performance/connection issues with a remote postgresql DB

Related topics