Input dataset - memory usage

VincentS · March 30, 2021, 8:53am

Hi,

I was wondering if there is a way to load the input dataset from a database in batches. We're running multiple instances of Prodigy on Kubernetes with quite a large dataset. Analogous to, for example, this standard recipe, we load the dataset with DB.get_dataset(dataset). This loads (I suspect by looking at the memory usage) the full dataset in memory. Preferably we'd load the dataset in batches or splits, to keep our pods/containers ephemeral and disposable.

Would that be possible, or would that require a manual split of the data?

Thanks!
Vincent

ines · April 2, 2021, 1:56am

Hi! The Database.get_dataset method will load the entire annotated dataset into memory, which makes sense for most use cases – but I definitely see the point in your case.

Under the hood, we're just querying the database using peewee and selecting the dataset of the given tables and returning the loaded examples. You can check out how it's done in prodigy/db.py – the most minimal standalone version is this:

from prodigy.components.db import Dataset, Example, Link, connect

db = connect()  # needs to run first to initialize DB proxy

def get_dataset(name: str):
    dataset = Dataset.get(Dataset.name == name)
    query = (Example.select().join(Link).join(Dataset).where(Dataset.id == dataset.id))
    examples = query.execute()
    return [eg.load() for eg in examples]

So instead of calling eg.load() for every example, you could also return a generator instead and consume that in batches – or whatever else works best for your use case If you do come up witha solution, I'd definitely be interested in what works best. Maybe we can just integrate something like it natively!

VincentS · April 2, 2021, 7:47am

Thanks @ines , I'll give it a go!

Topic		Replies	Views
Loading a dataset from the DB instead of from disk/api? usage , solved	4	1974	March 6, 2018
MemoryError for db-in on virtual machine database	4	951	December 6, 2018
Load dataset from recipe usage , database , solved	6	1712	October 15, 2018
Is it possible to use ner_manual without loading the dataset that you are adding to? enhancement , done , database , server	3	506	March 14, 2020
running out of memory or time database	2	642	March 13, 2020

Input dataset - memory usage

Related topics