Input dataset - memory usage

Hi,

I was wondering if there is a way to load the input dataset from a database in batches. We're running multiple instances of Prodigy on Kubernetes with quite a large dataset. Analogous to, for example, this standard recipe, we load the dataset with DB.get_dataset(dataset). This loads (I suspect by looking at the memory usage) the full dataset in memory. Preferably we'd load the dataset in batches or splits, to keep our pods/containers ephemeral and disposable.

Would that be possible, or would that require a manual split of the data?

Thanks!
Vincent

Hi! The Database.get_dataset method will load the entire annotated dataset into memory, which makes sense for most use cases – but I definitely see the point in your case.

Under the hood, we're just querying the database using peewee and selecting the dataset of the given tables and returning the loaded examples. You can check out how it's done in prodigy/db.py – the most minimal standalone version is this:

from prodigy.components.db import Dataset, Example, Link, connect

db = connect()  # needs to run first to initialize DB proxy

def get_dataset(name: str):
    dataset = Dataset.get(Dataset.name == name)
    query = (Example.select().join(Link).join(Dataset).where(Dataset.id == dataset.id))
    examples = query.execute()
    return [eg.load() for eg in examples]

So instead of calling eg.load() for every example, you could also return a generator instead and consume that in batches – or whatever else works best for your use case :slightly_smiling_face: If you do come up witha solution, I'd definitely be interested in what works best. Maybe we can just integrate something like it natively!

Thanks @ines , I'll give it a go!