I imported a dataset with 4M annotations, which is stored on an AWS db instance and accessed from an AWS EC2 instance with 16GB memory. Other smaller datasets work fine. However even simple operations on that dataset tend to do bad things to the machine, even when nothing else is running. The process takes many hours without completing, the terminal goes unresponsive to Ctrl-C and another shell cannot establish an SSH connection.
Is it conceivable that prodigy stats the-big-data-set could run the instance out of memory? Or the db connection time out? Is it worth trying to use the python db api to probe deeper? I'd be happy to delete the dataset, but prodigy drop the-big-data-set also shows the problem.
Having insight into these possible practical limitations would be valuable. Can people share the size of their largest datasets, please?
Version 1.9.6
Location /usr/local/lib/python3.8/site-packages/prodigy/recipes
Prodigy Home /prodigy
Platform Linux-4.4.0-1072-aws-x86_64-with-glibc2.2.5
Python Version 3.8.1
Database Name MySQL
Database Id mysql
Total Datasets 15
Total Sessions 51
Hi! By imported a dataset, do you mean existing annotations? Or do you mean the input data file?
To collect annotations, you shouldn't have to import anything upfront – the database should really only hold the collected annotations. The input streams are generators by default and only process single batches at a time, so if you can stream your input data, there shouldn't be a problem with very large or even potentially infinite streams.
If you have existing annotations, importing them to Prodigy really only makes sense if you want to use Prodigy for training or if you want to automatically exclude examples that have already been annotated. If you're dealing with such large datasets, training via Prodigy probably doesn't make much sense – you typically want to train with spaCy directly, using the CLI and possibly on a GPU. And if what you care about it excluding annotations, all you need is the hashes.
Even if you're using Prodigy to collect millions of annotations (which shouldn't be a problem at all), you probably wouldn't want to be adding them all to the same single dataset. Datasets in Prodigy let you group annotations together and are intended to hold "single units of work that belong together". Prodigy does assume that it's generally no problem to load a single dataset into memory.
I'd say that's possible, yes – after all, that'd load the whole dataeset into memory, parse all JSON records in it and then compute additional stats. You can probably run some profiling to see exactly where the problem happens. Under the hood, Prodigy currently uses peewee to manage the database. I don't think using Prodigy's Python API will make a difference – the commands pretty much call into that directly.
Have you tried using the drop command with a --batch-size? This was added to prevent timeouts for large datasets, so maybe that'll solve the problem for you. If not, you can always delete the recods directly in your MySQL database from the Dataset, Link and Example tables. Make sure to delete both the examples and their links (which specify the datasets they're part of so examples can appear in multiple sets).
I did import previously annotated data, thinking that it was simpler to keep all of a project's data (text, annotations, and source metadata) in the same mysql, even if exporting to spaCy JSON is later necessary when operating at scale. But after more reading, spaCy's DocBin plus custom Doc attributes in versioned S3 buckets is the way forward.
It's good to know that Prodigy generally assumes datasets fit in memory, thanks.
Unfortunately prodigy prodigy drop -n 1000 has been running for half an hour, with memory growing about 1-2MB/second, so it doesn't seem usable for me. I'll plan to delete with mysql.