Error when 'prodigy drop'

I get this error when i try to ‘drop’ the db.
“prodigy drop equinor_split_20p”

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/peewee.py", line 3830, in execute_sql
    cursor.execute(sql, params or ())
sqlite3.OperationalError: too many SQL variables

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/prodigy/__main__.py", line 251, in <module>
    plac.call(commands[command], arglist=args, eager=False)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/prodigy/__main__.py", line 148, in drop
    dropped = DB.drop_dataset(set_id)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/prodigy/components/db.py", line 354, in drop_dataset
    if len(link_examples) == 0:
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/peewee.py", line 3298, in __len__
    return len(self.execute())
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/peewee.py", line 3274, in execute
    self._qr = ResultWrapper(model_class, self._execute(), query_meta)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/peewee.py", line 2939, in _execute
    return self.database.execute_sql(sql, params, self.require_commit)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/peewee.py", line 3837, in execute_sql
    self.commit()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/peewee.py", line 3656, in __exit__
    reraise(new_type, new_type(*exc_args), traceback)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/peewee.py", line 135, in reraise
    raise value.with_traceback(tb)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/peewee.py", line 3830, in execute_sql
    cursor.execute(sql, params or ())
peewee.OperationalError: too many SQL variables

When i say “prodigy stats -l” i can still see the name. if i copy-paste the same name in ‘drop’ to delete it, i get this error.

I’m seeing this same error, did you figure out what the problem was?

No… I have noticed that it happens in mac. Does not occur in Unix systems. But i didn’t find a solution yet.

Ah, yes I'm on a mac too.

It looks like it's related to the SQLITE_MAX_VARIABLE_NUMBER limit, which on my machine is set to 999 -- this is around the number of training examples where dropping a dataset gives an error. It's discussed a bit in this answer:

but I don't have a fix for it.

Hmm… this could be a bug. it works for some datasets and does not work for some. It doesn’t break for all of it though. May be i should check for the size of datasets.

Ok, This seems to be a bug. happening from db.py of the prodigy/components.
In line 354, it is checking len(link_examples) for zero and non-zero for the contents. but the parameter ‘len’ does not seem to work for that class <class ‘prodigy.components.db.Link’>. (that class “Link” does not have an overriding method for ‘len’ too) It might work if that is a list. I have not dug into why it works on unix though. maybe it is available in a different python version?
@ines @honnibal - is this really a bug?

Thanks so much for the detailed analysis!

If this indeed comes down to the SQLITE_MAX_VARIABLE_NUMBER, we could probably have a workaround for this. From the StackOverflow thread, it sounds like simply doing it in several batches is a solution? So in drop_examples and/or drop_examples in the db.py, you could try splitting the items to delete in batches of < 999 and then make several passes over them to delete them (instead of doing it all in one).

yes, I believe that would fix the errors on mac, presumably the default is set (much?) higher on unix. I’m going to try to get some time next week to test and make sure it fixes the issue.

Okay, turns out batching up the records definitely seems to fix this. So we’ll have a patch for this in the upcoming version of Prodigy :slightly_smiling_face:

Hi,

I am getting the same error in a windows machine with prodigy version 1.6.1, which was not present in version 1.5.1

Traceback (most recent call last):
  File "C:\Users\sa064014\AppData\Local\Continuum\anaconda3\lib\site-packages\peewee.py", line 3830, in execute_sql
    cursor.execute(sql, params or ())
sqlite3.OperationalError: too many SQL variables

I am currently using another python environment with Prodigy 1.5.1 to drop datrasets

thank you in advance
kind regards
claudio

Hello Everyone!

We can replace the drop_dataset function in db.py located at

…/prodigy/components/db.py

PS: Ofcourse, as always it is good to have a database backup prior to testing

    def drop_dataset(self, name):
        """
        name (unicode): The name of the dataset to drop.
        RETURNS (bool): True if dataset was dropped.
        """
        with self.db.atomic():
            log("DB: Dropping dataset '{}'".format(name))
            dataset = Dataset.get(Dataset.name == name)
            links = Link.select().where(Link.dataset_id == dataset.id)
            found_datasets = [dataset.id]
			
            # Grab all the example ids to be deleted
            original_example_ids = Link.select(Link.example_id).where(Link.dataset_id == dataset.id)

            # Grab the list of link ids to be deleted (same dataset and linked datasets sharing the same examples)
            link_others = Link.select(Link.id).where(Link.example_id << original_example_ids).select(Link.id).where(Link.dataset_id != dataset.id)
            link_self = Link.select(Link.id).where(Link.example_id << original_example_ids).select(Link.id).where(Link.dataset_id == dataset.id)

            # Grab the list of datasets ids sharing the examples
            for ds in Link.select().where(Link.example_id << original_example_ids).select(Link.dataset_id).distinct().where(Link.dataset_id != dataset.id):
                found_datasets.append(ds.dataset_id)

            # Delete all the matching examples
            Example.delete().where(Example.id << original_example_ids).execute()
            
            # Delete all the matching links from other datasets
            Link.delete().where(Link.id << link_others).execute()

            # Delete all the matching links from selected dataset
            Link.delete().where(Link.id << link_self).execute()
            
            # Delete the datasets without any links
            for ds in found_datasets:
                if Link.select(Link.id).where(Link.dataset_id == ds).count() == 0 :
                    Dataset.delete().where(Dataset.id == ds).execute()
            
            self.db.commit()
            log("DB: Removed dataset '{}'".format(name))
        return True

Note:

The following order of deletion must be followed (unless you plan on cutting the branch you are sitting on)

  1. Example
  2. Link (other datasets sharing the same examples)
  3. Link (selected dataset)
  4. Dataset (selected and other datasets if they are empty)

We can also clean the prodigy database (with unreferenced examples/datasets), by connecting to prodigy db using SQLite studio and executing the following query

    delete from link where link.id in (SELECT id FROM link GROUP BY example_id HAVING count(*) = 1);
    delete from example where example.id not in (select link.example_id from link);
    delete from dataset where dataset.id not in (select link.dataset_id from link);
    VACUUM;

Thanks and Regards,
Swarup Selvaraj.

2 Likes