Slow database operations on PostgreSQL

Hi, I've noticed that the db-in, db-merge and drop operations are running slowly for me. Importing a data set of 20000 annotations (25mb) takes more than 20 minutes. In contrast, db-out takes just seconds.

I'm using PostgreSQL 13.3 on an r6g.large instance on AWS, which does not receive other queries during the operations. So it's likely not a hardware constraint.

Looking at the currently running queries, it seems that each example is added/deleted individually, meaning the database has to work through 20k requests.

  • Are there any database-related settings that I have overlooked? I checked the prodigy.json docs and the psycopg2 docs
  • Is it possible to do a bulk operation with Python rather than the CLI?

Is this still an issue? I was going through some of the old items and noticed we never responded to this ticket, sorry about that!

You can always use the Python API directly to save batches of data. The code would look something like the example list here:

from prodigy.components.db import connect

examples = [{"text": "hello world", "_task_hash": 123, "_input_hash": 456}]

db = connect()                               # uses settings from prodigy.json
db.add_dataset("test_dataset")               # add dataset
assert "test_dataset" in db                  # check that dataset was added
db.add_examples(examples, ["test_dataset"])  # add examples to dataset
dataset = db.get_dataset("test_dataset")     # retrieve a dataset
assert len(dataset) == 1                     # check that examples were added

You can re-use db.add_examples multiple times to upload 2000 examples at a time.

I would imagine that this can help speed things up because it allows you to control the batch size, but I'm a bit surprised that PostgreSQL would have issues with this in the first place. You mention 25mb, and only 20K examples, which isn't huge. Let me know if the issue persists though, I'll gladly dive in deeper.

It's also possible that a more recent version of Prodigy fixes this too, there have been updates to the ORM, which might've improved this experience.