Hi,
I was wondering if there is a way to load the input dataset from a database in batches. We're running multiple instances of Prodigy on Kubernetes with quite a large dataset. Analogous to, for example, this standard recipe, we load the dataset with DB.get_dataset(dataset)
. This loads (I suspect by looking at the memory usage) the full dataset in memory. Preferably we'd load the dataset in batches or splits, to keep our pods/containers ephemeral and disposable.
Would that be possible, or would that require a manual split of the data?
Thanks!
Vincent