db-out terminated prematurely

When running db-out I see the following:

prodigy db-out messages_05_13_22_split_51k-0 /tmp
19:45:03: INIT: Setting all logging levels to 20
email-validator not installed, email fields will be treated as str.
To install, run: pip install email-validator
19:45:03: RECIPE: Calling recipe 'db-out'
19:45:03: DB: Initializing database PostgreSQL
19:45:03: DB: Connecting to database PostgreSQL
19:45:05: DB: Loading dataset 'messages_05_13_22_split_51k-0' (20077 examples)
Killed

I have seen a couple other posts concerning similar issues although these seemed to have occurred during RAM-intensive tasks or with RAM-intensive datasets (e.g. model training, image datasets). In my case I simply trying to export ~20k annotations of text. I wouldn't expect this to be particularly memory intensive - the json should be on the order of 50-60MB. The machine I am running on is largely locked-down for me so there isn't much I can do in the way of investigating or profiling.

A few questions:

  • Are there any other known issues with db-out which could explain this?
  • If it is simply OOM failure, are there solutions for exporting annotations in memory-constrained environments? The obvious move is to trim down my datasets from anything not needed, but they are fairly lean as it stands and (as mentioned above) I'm working with relatively small text data. Could I export in batches?

That's interesting. I agree that ~20K text annotations should be fine. Just to double-check, could you share your Prodigy and Python versions?

We might be able to understand what's happening some more by pulling the data from Python. The Killed output could mean multiple things, but the db-out command uses this code internally:

from prodigy.components.db import connect

# You probably need to make sure that you've got credentials set up
# https://prodi.gy/docs/api-database#setup-postgresql
# because `connect()` uses settings from prodigy.json
DB = connect()

dataset_name = "messages_05_13_22_split_51k-0"
# This should give a list of dictionaries
examples = DB.get_dataset_examples(dataset_name)

Does this also cause an error? If so, can you confirm that the error is happening from the Python process and not from the Postgres Database?

I also couldn't help but notice that there's a warning message about an email validator. Is that related to a custom plugin?

Python version = 3.8.13
Prodigy version = 1.10.8

Circling back to say this was indeed a RAM issue - a larger allocation of RAM resolved the issue. So while this is technically resolved, I am still concerned about what appears to be a large amount of memory overhead to perform db-out. As mentioned before, the dataset to be exported is not large (56MB) and I had ~800MB available RAM (before making the larger RAM allocation mentioned above). Is db-out known to be a memory-hungry process?

Thanks for your update!

db-out is mostly just doing two things: pulling examples through get_examples then exporting to .jsonl through srsly.write_jsonl. You can use these functions to write your own function that is a bit more efficient. I suspect it's likely the get_examples part that is causing more of the problems.

Also, are you using the default database (SQLite)? I wonder if modifying this to different databases may help or hurt.

I haven't had time to debug much this functionality but if you find any interesting suggestions/best practices, we'd be grateful for any findings!