Database performance/connection issues with a remote postgresql DB

I'm having issues with a remote postgresql database. The issues appear to be connection related. While I can connect and export data via python, when I run prodigy it just hangs.

I'm really not sure what's going on or how to debug this so I'm gonna dump all that I know.

Database:

  • PostgreSQL 13.10
  • AWS RDS db.t3.small, 30GB storage
  • Currently about 2.5GB of "data" in the database
  • When I export all the datasets to json files I get about 100MB of data

DB settings:

{
  "db": "postgresql",
  "db_settings": {
    "postgresql": {
      "dbname": "{{db_name}}",
      "user": "{{db_user}}",
      "password": "{{db_password}}"
    }
  }
}

# ENV
PGHOST=dbname.etc.rds.amazonaws.com

Accessing DB
I can connect and export data fine using python, e.g.

from prodigy.components.db import connect
import json
db = connect()

for dataset in db.datasets:
    data = db.get_dataset_examples(dataset)
    if data:
        with open(f"{dataset}.json", mode='w') as writer:
            json.dump(data, writer, ensure_ascii=False, indent=4)

Now if I try to run a prodigy recipe, e.g. prodigy ner.manual new_data_set blank:en some_file.txt --label X,Y,Z,etc. it hangs,

Using 16 label(s): NOUN, VERB, PRONOUN, EVENT, IWI-HAPU, LANGUAGE, LAW, LOC,
ORG, PERSON, TANGATA, TIME, WAKA, MAHITOI, AUA, PRODUCT
^X^C^X^CTraceback (most recent call last):
  File "/webapp/lib/python3.10/site-packages/prodigy/__main__.py", line 62, in <module>
    controller = recipe(*args, use_plac=True)
  File "cython_src/prodigy/core.pyx", line 384, in prodigy.core.recipe.recipe_decorator.recipe_proxy
  File "cython_src/prodigy/core.pyx", line 73, in prodigy.core.Controller.from_components
  File "cython_src/prodigy/core.pyx", line 164, in prodigy.core.Controller.__init__
  File "cython_src/prodigy/core.pyx", line 247, in prodigy.core.Controller.get_dataset_named_sessions
  File "/webapp/lib/python3.10/site-packages/prodigy/components/db.py", line 257, in sessions
    return [ds.name for ds in datasets]
  File "/webapp/lib/python3.10/site-packages/prodigy/components/db.py", line 257, in <listcomp>
    return [ds.name for ds in datasets]
  File "/webapp/lib/python3.10/site-packages/peewee.py", line 4543, in next
    self.cursor_wrapper.iterate()
  File "/webapp/lib/python3.10/site-packages/peewee.py", line 4463, in iterate
    result = self.process_row(row)
  File "/webapp/lib/python3.10/site-packages/peewee.py", line 7706, in process_row
    data = super(ModelObjectCursorWrapper, self).process_row(row)
  File "/webapp/lib/python3.10/site-packages/peewee.py", line 7672, in process_row
    result[attr] = converters[i](row[i])
  File "/webapp/lib/python3.10/site-packages/peewee.py", line 4707, in python_value
    return value if value is None else self.adapt(value)
  File "/webapp/lib/python3.10/site-packages/peewee.py", line 4856, in adapt
    if isinstance(value, text_type):
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/webapp/lib/python3.10/site-packages/prodigy/__main__.py", line 62, in <module>
    controller = recipe(*args, use_plac=True)
KeyboardInterrupt

I had to interrupt the command. My instance running prodigy becomes slow and eventually unresponsive. So I have to cancel the command, KeyboardInterrupt. Prodigy never actually runs. Here's what RDS reports while the prodigy command is running,

Screenshot 2023-03-16 at 12.31.31

Screenshot 2023-03-16 at 12.31.26

Screenshot 2023-03-16 at 12.31.21

Something happens! It connects, etc, but it can't get to the running stage. If I run prodigy stats I get a similar behavior.

We've never had this problem before. What's changed since "before?" More data in the database, updated to the latest prodigy version. I've since updated the postgresql version and bumped up the instance size, but that didn't resolve my issues. If I run prodigy using a local database everything runs fine.

Update
I can connect to the database and run prodigy from another machine, which leads me to suspect something's up with the ec2 instance I'm using to run prodigy. I'm using Ubuntu 20.04LTS Machine AMI (a variant published by AWS for machine learning). Having said that, it still takes a long time for prodigy to "start up."

Any help, even with further debugging, would be greatly appreciated.

There's a memory problem. I've tried running prodigy while connecting to a large database/dataset and mentioned above in a few places. In one instance, I only had 1GB of RAM and I got a Killed message. Other topics here mentioned a similar problem and attributed it to not enough RAM. How much RAM do we actually need? I'm inspecting the prodigy process running locally on my Apple M1 and this is what it reports,

That's not much memory. Then I decided to quit the prodigy process and run it again, and this time I watched the Real Memory Size grow, and decided to record it (about a minute since running the command)

At it's peak, prodigy needed almost 19GB of RAM!

What are we doing wrong here? Here's the command,

prodigy ner.manual some_dataset_name blank:en some_file.txt --label PERSON,LOC,FOOBAR

That file is only 1.3MB in size.

Is it possible we've stuffed up our database? We run multiple instances of prodigy running different experiments at different ports and people can work on those experiments with unique user sessions. Our database has about 30 different datasets. The prodigy instances ran 24/7 and a number of people labeled data. Are we meant to "merge" or "squash" data at some stage? Are there any easy automated ways to do that, e.g. squash labels when they don't conflict? Or what would be the best approach to export all labels and then start from scratch?

I've since created a new database to get up and running again. I'm running two recipes at the same time both using the same dataset and source .jsonl file which is about 8MB in size. What I observe is a huge increase in storage space on the database. I don't know much about the inner workings of DBs, but why does running two prodigy recipes do this? Could this also be related to the huge use in RAM when starting up prodigy?

Thanks for the detailed report, and sorry for the frustrations!

I'm shooting in the dark a bit here, and I'll forward the issue to someone in the team who's more across the database connectivity. But I wanted to give a quick reply in case I catch you before you're off for the night, given the time difference.

This looks suspicious: Releases · coleifer/peewee · GitHub

There was a new version of the DB library we use, Peewee, two weeks ago. Can you try installing peewee v3.15.4 and see if that helps?

I've tried peewee==3.15.4 but I'm still seeing the same issues. When trying to connect to the old database the process is Killed because it ran out of memory. The ram requirements for running prodigy must be proportional to the amount of data we have in the database but that seems like a strange requirement to me.

In the screen shots below prodigy consumes ram while trying to load until the process is killed once the machine runs out of ram.

Upon updating a read me I found a comment I wrote 8 months ago that provides more information about this problem. It sounds like I encountered this problem before when upgrading to 1.11 from 1.10. So it's possible this problem doesn't exist in 1.10.

Some notes when I tried to upgrade to 1.11. It looks like something happens with the database that causes it to run out of space. And then after that nothing runs. Maybe we need to do some sort of prodigy data base update I have no idea. I rebooted the database and it got its space back. Perhaps when we update, we just need to run a single service so the database can do its update. I have no idea. Basically don't update prodigy.

Troubling!

We haven't been able to reproduce this locally and we really want to get to the bottom of it. I hope we can arrange a call? Most of the team is in Europe (myself included) so timezone overlap to NZ is as bad as it gets. We do have someone on the team in Seattle though, so that should be better.

We'll email to try to set something up.

Hi Keoni,

I've been doing a deep dive on our database implementation since our meeting yesterday, especially looking at Postgres performance. While I haven't been able to reproduce the degree of Memory consumption you were seeing in your queries (yet at least), I did find some obvious places where memory consumption could be improved and are clearly much less performant for Postgres vs SQLite. If you're willing to help debug a little more that would be awesome!

Rather than me making fixes and publishing a new version, I thought it would be great if you could try a few small changes in your installed Prodigy code so we can have a tighter feedback loop. The changes would all be in the prodigy/components/db.py file which is one of the few Prodigy files not compiled with Cython so you're free to make changes there and should see them reflected immediately in your env.

So in your virtualenv you'll have prodigy installed in a path something like /path/to/my_project/.venv/lib/python3.9/site-packages/prodigy

First change to try

In prodigy/components/db.py

The program you shared was hanging post dataset creation when we do a call to db.count_dataset() in the Controller for each session_id associated with the current dataset.

The last 2 lines of the count_dataset method load all examples for the dataset, then check the len.

Last 2 lines of Current Code:

def count_dataset(self, name: str, session: bool = False) -> int:
    ...

    examples = query.execute()
    return len(examples)

Change last 2 lines of code to:

def count_dataset(self, name: str, session: bool = False) -> int:
    ...

    count = query.count()
    return count

I'd love to see the memory consumption output from the before and after here if you're able to provide that as well.

Also, 1 thing I didn't get from you earlier which would be very helpful is the output of calling prodigy stats for the main dataset you were using.

Specifically, I'm looking for the stats for the problem dataset so you'd need to provide the name of the dataset as the first arg to the stats command:

prodigy stats your_dataset_name

This count_dataset change is something that will go into the next Prodigy version regardless but if it fixes your startup issue by itself we can make a patch release with this change.

Thanks!

2 Likes