Database performance/connection issues with a remote postgresql DB

I'm having issues with a remote postgresql database. The issues appear to be connection related. While I can connect and export data via python, when I run prodigy it just hangs.

I'm really not sure what's going on or how to debug this so I'm gonna dump all that I know.

Database:

  • PostgreSQL 13.10
  • AWS RDS db.t3.small, 30GB storage
  • Currently about 2.5GB of "data" in the database
  • When I export all the datasets to json files I get about 100MB of data

DB settings:

{
  "db": "postgresql",
  "db_settings": {
    "postgresql": {
      "dbname": "{{db_name}}",
      "user": "{{db_user}}",
      "password": "{{db_password}}"
    }
  }
}

# ENV
PGHOST=dbname.etc.rds.amazonaws.com

Accessing DB
I can connect and export data fine using python, e.g.

from prodigy.components.db import connect
import json
db = connect()

for dataset in db.datasets:
    data = db.get_dataset_examples(dataset)
    if data:
        with open(f"{dataset}.json", mode='w') as writer:
            json.dump(data, writer, ensure_ascii=False, indent=4)

Now if I try to run a prodigy recipe, e.g. prodigy ner.manual new_data_set blank:en some_file.txt --label X,Y,Z,etc. it hangs,

Using 16 label(s): NOUN, VERB, PRONOUN, EVENT, IWI-HAPU, LANGUAGE, LAW, LOC,
ORG, PERSON, TANGATA, TIME, WAKA, MAHITOI, AUA, PRODUCT
^X^C^X^CTraceback (most recent call last):
  File "/webapp/lib/python3.10/site-packages/prodigy/__main__.py", line 62, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 384, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src/prodigy/core.pyx", line 73, in prodigy.core.Controller.from_components
  File "cython_src/prodigy/core.pyx", line 164, in prodigy.core.Controller.__init__
  File "cython_src/prodigy/core.pyx", line 247, in prodigy.core.Controller.get_dataset_named_sessions
  File "/webapp/lib/python3.10/site-packages/prodigy/components/db.py", line 257, in sessions
    return [ds.name for ds in datasets]
  File "/webapp/lib/python3.10/site-packages/prodigy/components/db.py", line 257, in <listcomp>
    return [ds.name for ds in datasets]
  File "/webapp/lib/python3.10/site-packages/peewee.py", line 4543, in next
    self.cursor_wrapper.iterate()
  File "/webapp/lib/python3.10/site-packages/peewee.py", line 4463, in iterate
    result = self.process_row(row)
  File "/webapp/lib/python3.10/site-packages/peewee.py", line 7706, in process_row
    data = super(ModelObjectCursorWrapper, self).process_row(row)
  File "/webapp/lib/python3.10/site-packages/peewee.py", line 7672, in process_row
    result[attr] = converters[i](row[i])
  File "/webapp/lib/python3.10/site-packages/peewee.py", line 4707, in python_value
    return value if value is None else self.adapt(value)
  File "/webapp/lib/python3.10/site-packages/peewee.py", line 4856, in adapt
    if isinstance(value, text_type):
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/webapp/lib/python3.10/site-packages/prodigy/__main__.py", line 62, in <module>
    controller = recipe(*args, use_plac=True)
KeyboardInterrupt

I had to interrupt the command. My instance running prodigy becomes slow and eventually unresponsive. So I have to cancel the command, KeyboardInterrupt. Prodigy never actually runs. Here's what RDS reports while the prodigy command is running,

Screenshot 2023-03-16 at 12.31.31

Screenshot 2023-03-16 at 12.31.26

Screenshot 2023-03-16 at 12.31.21

Something happens! It connects, etc, but it can't get to the running stage. If I run prodigy stats I get a similar behavior.

We've never had this problem before. What's changed since "before?" More data in the database, updated to the latest prodigy version. I've since updated the postgresql version and bumped up the instance size, but that didn't resolve my issues. If I run prodigy using a local database everything runs fine.

Update
I can connect to the database and run prodigy from another machine, which leads me to suspect something's up with the ec2 instance I'm using to run prodigy. I'm using Ubuntu 20.04LTS Machine AMI (a variant published by AWS for machine learning). Having said that, it still takes a long time for prodigy to "start up."

Any help, even with further debugging, would be greatly appreciated.

There's a memory problem. I've tried running prodigy while connecting to a large database/dataset and mentioned above in a few places. In one instance, I only had 1GB of RAM and I got a Killed message. Other topics here mentioned a similar problem and attributed it to not enough RAM. How much RAM do we actually need? I'm inspecting the prodigy process running locally on my Apple M1 and this is what it reports,

That's not much memory. Then I decided to quit the prodigy process and run it again, and this time I watched the Real Memory Size grow, and decided to record it (about a minute since running the command)

At it's peak, prodigy needed almost 19GB of RAM!

What are we doing wrong here? Here's the command,

prodigy ner.manual some_dataset_name blank:en some_file.txt --label PERSON,LOC,FOOBAR

That file is only 1.3MB in size.

Is it possible we've stuffed up our database? We run multiple instances of prodigy running different experiments at different ports and people can work on those experiments with unique user sessions. Our database has about 30 different datasets. The prodigy instances ran 24/7 and a number of people labeled data. Are we meant to "merge" or "squash" data at some stage? Are there any easy automated ways to do that, e.g. squash labels when they don't conflict? Or what would be the best approach to export all labels and then start from scratch?

I've since created a new database to get up and running again. I'm running two recipes at the same time both using the same dataset and source .jsonl file which is about 8MB in size. What I observe is a huge increase in storage space on the database. I don't know much about the inner workings of DBs, but why does running two prodigy recipes do this? Could this also be related to the huge use in RAM when starting up prodigy?

Thanks for the detailed report, and sorry for the frustrations!

I'm shooting in the dark a bit here, and I'll forward the issue to someone in the team who's more across the database connectivity. But I wanted to give a quick reply in case I catch you before you're off for the night, given the time difference.

This looks suspicious: Releases · coleifer/peewee · GitHub

There was a new version of the DB library we use, Peewee, two weeks ago. Can you try installing peewee v3.15.4 and see if that helps?

I've tried peewee==3.15.4 but I'm still seeing the same issues. When trying to connect to the old database the process is Killed because it ran out of memory. The ram requirements for running prodigy must be proportional to the amount of data we have in the database but that seems like a strange requirement to me.

In the screen shots below prodigy consumes ram while trying to load until the process is killed once the machine runs out of ram.

Upon updating a read me I found a comment I wrote 8 months ago that provides more information about this problem. It sounds like I encountered this problem before when upgrading to 1.11 from 1.10. So it's possible this problem doesn't exist in 1.10.

Some notes when I tried to upgrade to 1.11. It looks like something happens with the database that causes it to run out of space. And then after that nothing runs. Maybe we need to do some sort of prodigy data base update I have no idea. I rebooted the database and it got its space back. Perhaps when we update, we just need to run a single service so the database can do its update. I have no idea. Basically don't update prodigy.

Troubling!

We haven't been able to reproduce this locally and we really want to get to the bottom of it. I hope we can arrange a call? Most of the team is in Europe (myself included) so timezone overlap to NZ is as bad as it gets. We do have someone on the team in Seattle though, so that should be better.

We'll email to try to set something up.

Hi Keoni,

I've been doing a deep dive on our database implementation since our meeting yesterday, especially looking at Postgres performance. While I haven't been able to reproduce the degree of Memory consumption you were seeing in your queries (yet at least), I did find some obvious places where memory consumption could be improved and are clearly much less performant for Postgres vs SQLite. If you're willing to help debug a little more that would be awesome!

Rather than me making fixes and publishing a new version, I thought it would be great if you could try a few small changes in your installed Prodigy code so we can have a tighter feedback loop. The changes would all be in the prodigy/components/db.py file which is one of the few Prodigy files not compiled with Cython so you're free to make changes there and should see them reflected immediately in your env.

So in your virtualenv you'll have prodigy installed in a path something like /path/to/my_project/.venv/lib/python3.9/site-packages/prodigy

First change to try

In prodigy/components/db.py

The program you shared was hanging post dataset creation when we do a call to db.count_dataset() in the Controller for each session_id associated with the current dataset.

The last 2 lines of the count_dataset method load all examples for the dataset, then check the len.

Last 2 lines of Current Code:

def count_dataset(self, name: str, session: bool = False) -> int:
    ...

    examples = query.execute()
    return len(examples)

Change last 2 lines of code to:

def count_dataset(self, name: str, session: bool = False) -> int:
    ...

    count = query.count()
    return count

I'd love to see the memory consumption output from the before and after here if you're able to provide that as well.

Also, 1 thing I didn't get from you earlier which would be very helpful is the output of calling prodigy stats for the main dataset you were using.

Specifically, I'm looking for the stats for the problem dataset so you'd need to provide the name of the dataset as the first arg to the stats command:

prodigy stats your_dataset_name

This count_dataset change is something that will go into the next Prodigy version regardless but if it fixes your startup issue by itself we can make a patch release with this change.

Thanks!

2 Likes

Kia ora Kabir,

I'm back on this now.

I'm trying the suggested changes to db.py, but I'm on a slow internet connection (~10Mbps but sometimes slower) which makes what I'm doing painfully slow. But I can see now my RAM is being eaten up again meaning the problem still exists...

It looks like I was finally able to load this database, but I saw that I consumed up to 20GB of RAM, which is fine for my machine (64GB or RAM) but not practical. Once the database is loaded, the RAM usage goes back down. I think we already knew this (I'm refreshing my memory).

Whatever is happening when we're loading up a large dataset, it's really not very efficient. In terms of receiving data from a remote database on a slow internet connection (10Mbps), and in terms of loading up the dataset in RAM. So from a high level, I think there's a lot of room for improvement. In terms of what's actually causing the problems :man_shrugging:t4:. There are a lot of query.execute() commands and iterating through results of executes and that mush be what's taking up all the time. Is there no iterator that allows you to loop through the query item by item rather than executing and then looping through the results?

prodigy stats

This prodigy stats te_ara command is incredibly slow, but this could be related to my internet connection. And it finally completed... definitely took more than 10 minutes.

>> prodigy stats te_ara

============================== ✨  Prodigy Stats ==============================

Version          1.11.11                       
Location         /Users/keoni/tehikudev/whakairo-kupu/local/lib/python3.10/site-packages/prodigy
Prodigy Home     /Users/keoni/.prodigy         
Platform         macOS-13.4-arm64-arm-64bit    
Python Version   3.10.11                       
Database Name    PostgreSQL                    
Database Id      postgresql                    
Total Datasets   42                            
Total Sessions   24629544                      


============================== ✨  Dataset Stats ==============================

Dataset       te_ara             
Created       2021-03-03 09:11:32
Description   None               
Author        None               
Annotations   0                  
Accept        0                  
Reject        0                  
Ignore        0      


START: 2023-06-07 23:15:14.016383
>> prodigy stats
============================== ✨  Prodigy Stats ==============================

Version          1.11.11                       
Location         /Users/keoni/tehikudev/whakairo-kupu/local/lib/python3.10/site-packages/prodigy
Prodigy Home     /Users/keoni/.prodigy         
Platform         macOS-13.4-arm64-arm-64bit    
Python Version   3.10.11                       
Database Name    PostgreSQL                    
Database Id      postgresql                    
Total Datasets   42                            
Total Sessions   24629544                      

END: 2023-06-07 23:33:08.156518
DURATION: 17m 54s           

24 million sessions that might be the problem! Why we have that many sessions I could probably answer that. Can we "collapse" those sessions and tidy this up? I would like to do that and any help to do that would be appreciated. But the problem dataset, te_ara, doesn't seem to report anything noteworthy.

Note that when I run prodigy stats without the dataset it still runs incredibly slow.

TODO

This is my todo list to answer your question but I'm waiting too long for things to run so I might go to bed and leave this here for me to edit when I'm on a better connection but to also provide you an update until then.

query.count() vs. len(query.execute())

Show the RAM usage difference from this recommended change.
I note we have a similar pattern in Django land where query.count() is much more efficient.

Thanks for all the extra info Keoni! The 24million sessions thing is definitely a red flag and beyond the scope of what I was testing. That certainly makes some more sense as to why you have so much RAM usage.

I'll try working this amount of sessions into my testing and yes, we'll definitely have an option in v2 for returning a Python generator when querying examples from a dataset. We're updating our schema for Prodigy v2 to fix this but have also made a bunch of Database performance fixes in the newly released v1.12 alpha versions. See this post for installation:

This implements the query.count fix as well as uses Peewee's query.iterator() functionality on queries. It does gather still gather examples into a list, returning that list for now so as not to break the Database API contract.

The next alpha version we'll be releasing next week will also have improvements to inserting data that should help a lot for commands like prodigy db-in.

For deleting sessions:

Annotated examples are saved in Prodigy and linked to 2 datasets:

  1. the main dataset configured for the annotation server (this is the name you supply as a CLI argument when starting Prodigy)
  2. the session dataset associated with the configured session_id from using the ?session=MY_SESSION_NAME query parameter supplied in the browser or a session_id automatically created with a timestamp of when the annotation session was started.

Since the same examples get linked to both datasets, you can safely delete those sessions and you'll lose the database integrity of who annotated what but still retain the examples and the fact that they belong to a specific main dataset. If you're using named sessions, the saved example JSON should have a _session_id attr added to it which you can likely use to figure out who annotated what but that's not guaranteed to work correctly for the different feed_overlap settings.

Disclaimer out of the way, if you want to clean up these sessions you totally can:

from prodigy.components.db import connect


if __name__ == "__main__":

    db = connect()

    # This is just here to validate the main dataset exists
    dataset = db.get_dataset_by_name("your_main_dataset_name") 

    # This query might take forever to run for you given the number of sessions you have.
    # It does use Peewee query iterators though so it hopefully won't cause 
    # you to run out of memory at any point
    dataset_sessions = db.get_dataset_sessions("your_main_dataset_name")
    print("BEFORE SESSIONS LIST:", dataset_sessions)

    # dataset_sessions is a list of strs with Dataset names.
    # You first need to remove the links to all examples for this 
    # session dataset before you can delete it. 
    # If you want to clear all sessions for a dataset you can do this in a loop.
    db.unlink(dataset_sessions[0])
    
    # The session should no longer have any examples linked to it so you can safely drop it
    db.drop_dataset(dataset_sessions[0])

    # Querying again, the deleted session shouldn't be in this list anymore.
    dataset_sessions = db.get_dataset_sessions("your_main_dataset_name")
    print("AFTER SESSIONS LIST:", dataset_sessions)