Editing datasets

I may have missed it in the documentation, but is there some functionality for editing datasets, i.e. renaming, copying or deleting them, or editing the comments to them?

2 Likes

There are currently the following commands to interact with the datasets in the database:

  • prodigy db-in: Import annotations to a dataset.
  • prodigy db-out: Export annotations from a dataset or session.
  • prodigy drop: Remove a dataset or session from the database.

To view the datasets and sessions (each individual annotation session, named after the timestamp), you can use the prodigy stats command:

prodigy stats -l    # view stats and list all datasets
prodigy stats -ls  # view stats and list all datasets and sessions

We didn’t want to add too many arbitrary, Prodigy-specific commands to interact with the database at this point, because it easily gets messy and we weren’t sure how much the users would actually really need. So for now, if you want to rename an existing dataset, or change the description, you’d have to export and re-add it:

prodigy db-out my_set /tmp
prodigy db-in my_new_set /tmp/my_set.jsonl "Some description"
prodigy drop my_set  # optional: delete dataset

You can also preview an existing dataset on the command line using ner.print-dataset (example output) and textcat.print-dataset (example output). If the dataset is large, I’d recommend using less so you can navigate through them (with the -r flag to make sure the colors are displayed correctly):

prodigy ner.print-dataset news_headlines | less -r
11 Likes

OK, great. This should be all I need for now. Thanks!

1 Like

It allmost feels like this should be part of the docs. Is there a place for everything you can do with the command line?

The PRODIGY_README.html which is available for download with Prodigy has all those commands grouped into "Other recipes and commands" in the "Recipes" section of the table of contents. There's also an overview online on the recipes page – the examples focus more on the annotation recipes, but the database commands are also listed towards the end.

The thing is, there's not always a perfectly clear distinction between "commands" and "recipes" – technically, all Prodigy commands are also recipes, since they're wrapped in the @recipe decorator and use the same CLI style. You can also easily add your own Prodigy commands via custom recipes. (If a recipe doesn't return a dictionary of components, Prodigy will just execute the function and not start the server – so you could easily write your own database commands if there's extra functionality you need.)

Is there a way to drop certain annotations from a database, but not dropping the entire dataset?

How are you identifying the examples? Do you have their hashes, or are you using some custom logic for filtering?

There's no direct command for this because specifying task IDs on the CLI typically isn't that useful. But you could do this by filtering the examples to remove the ones you want to drop, saving them to a new dataset, and dropping the old one. (If you want to use the same name for both datasets, don't forget to back up your previous annotations – otherwise, a small bug in your code between dropping the old set and re-adding the examples can cause you to lose the data.)

1 Like