Feature Request: Bulk Dataset Drop

wpm · February 9, 2018, 3:55pm

While experimenting with Prodigy I find myself creating and dropping lots of datasets. Every once in a while I go through and clean up every dataset. This is a little tedious because prodigy drop only works on one dataset at a time.

It would be nice to have ways to make this faster. prodigy drop takes multiple arguments, or maybe patterns to match the names of datasets.

andy · February 9, 2018, 4:12pm

You could use the drop function in Prodigy’s __main__.py and apply it over a list with something like this (WARNING, untested):

from prodigy.components.db import connect
DB = connect()
# could take as a plac input
to_drop = "db1,temp1,test4"
to_drop = [i.strip() for i in to_drop.split(",")]

# [Copy this function from __main__.py, line 132]
def drop(set_id):
    """
    ...
    """

for db in to_drop:
    drop(db)

ines · February 9, 2018, 4:18pm

@andy That’s a nice idea actually!

Alternatively, Prodigy’s Database also has a drop_dataset method – this is a little more direct, but won’t give you any warnings or print any results.

db = connect()
for dataset in ['db1,' 'temp1', 'test4']:
    db.drop_dataset(dataset)

I’ve also been thinking about building a little app that lets you view (and potentially manage) datasets in the browser. Like, a “Prodigy Dataset Explorer”. We wouldn’t necessarily ship this with the library, but it could be a nice open-source addon that users could install and contribute to

justindujardin · February 9, 2018, 6:04pm

I've got something like that as an Electron app, if you're using sqlite. I've been using it for reviewing and updating annotations on my project, and as a scratch pad for making programmatic changes to my db. That's what the beautiful "custom" button does on the top right.

I pushed it to Github, in case it's useful for other people: GitHub - justindujardin/prodigy-viewer: An app for reviewing and changing prodigy annotations after an annotation session is complete.

ines · February 9, 2018, 6:38pm

@justindujardin Oh wow, this is amazing!!! It makes me so happy to see all the cool stuff you and others are building with and for Prodigy (We should probably start compiling a list of all addons, scripts and custom recipes soon. Luckily, the prodigy topic on GitHub is unoccupied, which is pretty nice!)

Btw, in terms of the database connection: Not sure how easy it would be to integrate something like this, but in theory, the app could also communicate with Prodigy’s database via a REST API. For example:

DB = prodigy.components.db.connect()

@hug.get('/dataset/{dataset_id}')
def get_dataset(dataset_id):
    examples = DB.get_dataset(dataset_id)
    return {'examples': examples}

To make this work more smoothly, we probably need a few additional database methods, though (like update_example etc).

justindujardin · February 9, 2018, 9:15pm

Yeah, that'd be much better! I've been putting all the sqlite specific stuff in an angular service, so it should be pretty easy to swap out.

The public API of the sqlite service class is probably a good reference for that kind of stuff. You may be right that it could all boil down to an update_example endpoint.

emiltj · January 21, 2023, 11:06am

Just here to agree, that it would be nice to be able to drop multiple in the prodigy drop CLI, without having to use python or add-ons.

And additionally perhaps a prodigy drop -all feature

koaning · January 23, 2023, 10:13am

I'm a little bit wary of supporting a drop --all here, mainly because it can lead to a database loosing all of it's data. Unless everyone has proper backups, supporting this might cause dramatic accidents to take place.

emiltj · January 31, 2023, 2:31pm

Could be a good argument against it.

However, dropping a list of datasets would have been hugely beneficial for my latest project.

koaning · January 31, 2023, 2:34pm

I understand. But having a custom Python script seems like a sensible hurdle to make sure folks don't do it by accident.

emiltj · February 1, 2023, 4:52am

Sorry to keep dragging the thread on, but this doesn't speak against being able to bulk drop by a list, does it?

E.g. have the functionality matching:

prodigy drop dataset1,dataset2,dataset3

koaning · February 3, 2023, 3:20pm

If you really want to delete all files, you can also choose to manually delete the Sqlite database file locally. It's not something I recommend, because there's a risk of loosing data. But that would really delete it all.

Another option could also be to use bash directly. This would allow you to delete all tables by listing all the names in a file.

Suppose that you have this file called names.txt:

name-a
name-b
name-c

Then you could run:

cat names.txt | while read line; do prodigy drop "$line"; done

Topic		Replies	Views
Dropping dataset from code doesn't properly delete examples done , database	12	3193	June 5, 2020
Bulk filter/review of dataset after tagging usage	5	441	April 27, 2022
Tip: Turn prodigy.db into web interface & JSON API with datasette usage	0	677	November 14, 2017
Editing datasets usage , database , solved	6	12263	June 2, 2021
When trying to use a dataset >50k, prodigy is unable to start up? database	4	498	October 1, 2019

Feature Request: Bulk Dataset Drop

Related topics