Unable to delete dataset

Hi, I am trying to delete a dataset (and session associated) to create a new one. The dataset has 36269 annotations and I get this error:

$ prodigy drop -n 500 uc-specific-reviewed
✘ Unable to delete dataset 'uc-specific-reviewed' because a database
limit on the number of query variables was reached. On some systems this limit
is quite low, so using a custom batch size may resolve the issue. Try: prodigy
drop -n 500 uc-specific-reviewed.

As you can see, I have already tried using the corrected command. What should I do?

Welcome to the forum @sunnielou :wave:

I understand the same error message appears when you just try prodigy drop uc-specific-reviewed.

Could you try with progressively lower batch sizes starting with 100?

Thanks @magdaaniol!

Yes, the same error appears without specifying the batch size. I tried with 100, 50, and 10 and I got the same error message:

prodigy drop -n 10 uc-specific-reviewed
✘ Unable to delete dataset 'uc-specific-reviewed' because a database
limit on the number of query variables was reached. On some systems this limit
is quite low, so using a custom batch size may resolve the issue. Try: prodigy
drop -n 500 uc-specific-reviewed

I see. Thanks for trying out different batch sizes.

Could you also provide the following details:

  1. The output of the prodigy stats command.
  2. The version of peewee you're using (pip freeze | grep peewee).
  3. If you're using SQLite, the version of sqlite3 (sqlite3 --version).
  4. If you're using SQLite, could you check what SQLITE_MAX_VARIABLE_NUMBER is set to on your system?

Additionally, are you using the built-in Prodigy DB or a custom DB? Lastly, what kind of annotations are you working with (text, audio, images, video)? If you're working with audio, images, or video, are you storing base64-encoded data in the DB?

I am sorry for the delayed response. Here are the outputs:

prodigy stats
Version 1.15.7
Platform macOS-14.6.1-arm64-arm-64bit
Python Version 3.9.6
spaCy Version 3.7.6
Database Name SQLite
Database Id sqlite
Total Datasets 4
Total Sessions 3

The peewee version: peewee==3.16.3
SQLite version: 3.43.2
The SQLITE_MAX_VARIABLE_NUMBER is supposed to be 32766 since I have a version newer than 3.32.
I am also using the built-in Prodigy DB and working with text annotations.

Thanks for the outputs.

There's clearly some sort of build or versioning that I'm not understanding that results in a lower-than-expected SQLITE_MAX_VARIABLE_NUMBER. I'm sure there's a way to query that so that's one avenue to pursue.

But stepping back a bit, I think the DB.drop_dataset() function can be reimplemented to avoid this problem altogether, using smarter queries. I've drafted a PR trying this out.

Could you try the following for me?

from prodigy.components.db import connect

DB = connect()
sessions = DB.get_dataset_sessions("my-dataset")

The get_dataset_sessions() method also includes some query logic that could be causing queries with a large number of objects, so I want to check whether we need to fix that one too.

Here's an overview of the problem and the patch that will hopefully fix it.

The DB has a many-to-many mapping between the Example model and the Dataset model, via the model Link. The current logic for drop_dataset() has a lot of logic in Python to find the set of example IDs that aren't part of any other datasets or sessions. It then goes ahead and deletes the examples that don't have other references.

What I've done instead is first check whether we have any 'orphan' examples in the database, and if we do, raise an error. We shouldn't have orphan examples, but if we do, the user can either delete them with a new DB.delete_orphan_examples(), or link them to a new dataset with the new DB.get_orphan_example_ids() method.

If we know we don't have any orphan examples, we just need to delete all links pointing to the datasets that will be deleted, and then follow that up by deleting all orphan examples. This way we don't have any queries that have to enumerate a set of link IDs or example IDs to delete, so we shouldn't run into this problem.

Hi Matthew,

I got this error for the get_dataset_sessions() method:

OperationalError: too many SQL variables

Regarding the SQLITE_MAX_VARIABLE_NUMBER, can you help me extract this number? I have tried a couple things but they did not work, since I do not have a database implemented on my local machine but I have been using Prodigy locally.

Hi @sunnielou,

Do you have access to the remote machine where database lives? If yes, you could query it from sql CLI:

PRAGMA compile_options;

This should contain MAX_VARIABLE_NUMBER.

I also wanted to let you know that we have recently shipped Prodigy 1.16.0 that includes @honnibal 's reimplementation of the drop logic - perhaps you could try deleting your dataset with Prodigy 1.16.0?

Hi @magdaaniol,

I got this: MAX_VARIABLE_NUMBER=500000

I tried installing Prodigy 1.16.0 with the python -m pip install --upgrade prodigy command but I only got Prodigy 1.15.8, should I try with python -m pip install --pre prodigy?

Hi @sunnielou ,

Could you access the download URL https://XXXX-XXXX-XXXX-XXXX@download.prodi.gy in the browser and check if 1.16.0 is on the list there?
Thanks!

Hi @magdaaniol,

I've checked and 1.16.0 is not on the list.

Thanks @sunnielou. We wanted to make sure you have the access to the fix you helped us identify.
The email you use on the forum is not mapped to any order in our database.
Could you please email us back with your license key or the order number (or the email you used to make the purchase) at contact@explosion.ai?
Thank you.

I've sent the email now. Thank you for your help. :smiley: