Editing datasets

Mede · October 25, 2017, 11:37am

I may have missed it in the documentation, but is there some functionality for editing datasets, i.e. renaming, copying or deleting them, or editing the comments to them?

ines · October 25, 2017, 1:05pm

There are currently the following commands to interact with the datasets in the database:

prodigy db-in: Import annotations to a dataset.
prodigy db-out: Export annotations from a dataset or session.
prodigy drop: Remove a dataset or session from the database.

To view the datasets and sessions (each individual annotation session, named after the timestamp), you can use the prodigy stats command:

prodigy stats -l    # view stats and list all datasets
prodigy stats -ls  # view stats and list all datasets and sessions

We didn’t want to add too many arbitrary, Prodigy-specific commands to interact with the database at this point, because it easily gets messy and we weren’t sure how much the users would actually really need. So for now, if you want to rename an existing dataset, or change the description, you’d have to export and re-add it:

prodigy db-out my_set /tmp
prodigy db-in my_new_set /tmp/my_set.jsonl "Some description"
prodigy drop my_set  # optional: delete dataset

You can also preview an existing dataset on the command line using ner.print-dataset (example output) and textcat.print-dataset (example output). If the dataset is large, I’d recommend using less so you can navigate through them (with the -r flag to make sure the colors are displayed correctly):

prodigy ner.print-dataset news_headlines | less -r

Mede · October 25, 2017, 1:19pm

OK, great. This should be all I need for now. Thanks!

koaning · March 19, 2018, 3:34pm

It allmost feels like this should be part of the docs. Is there a place for everything you can do with the command line?

ines · March 19, 2018, 3:58pm

The PRODIGY_README.html which is available for download with Prodigy has all those commands grouped into "Other recipes and commands" in the "Recipes" section of the table of contents. There's also an overview online on the recipes page – the examples focus more on the annotation recipes, but the database commands are also listed towards the end.

The thing is, there's not always a perfectly clear distinction between "commands" and "recipes" – technically, all Prodigy commands are also recipes, since they're wrapped in the @recipe decorator and use the same CLI style. You can also easily add your own Prodigy commands via custom recipes. (If a recipe doesn't return a dictionary of components, Prodigy will just execute the function and not start the server – so you could easily write your own database commands if there's extra functionality you need.)

yw2903 · June 1, 2021, 6:53pm

Is there a way to drop certain annotations from a database, but not dropping the entire dataset?

ines · June 2, 2021, 3:00am

How are you identifying the examples? Do you have their hashes, or are you using some custom logic for filtering?

There's no direct command for this because specifying task IDs on the CLI typically isn't that useful. But you could do this by filtering the examples to remove the ones you want to drop, saving them to a new dataset, and dropping the old one. (If you want to use the same name for both datasets, don't forget to back up your previous annotations – otherwise, a small bug in your code between dropping the old set and re-adding the examples can cause you to lose the data.)

Topic		Replies	Views
How do we inspect dataset sessions? usage , database , solved	3	2035	August 9, 2018
Delete annotation from dataset/database usage , database	1	1858	January 15, 2019
How to edit existing texts that were added to a dataset using db-in ner , database	3	1074	February 3, 2020
Reviewing/Editing annotated data usage , review , streams	1	946	June 23, 2020
Feature request: a recipe to print the names of all your datasets database , solved	3	1992	April 14, 2020

Editing datasets

Related topics