Dataset management

nomiizz · January 13, 2021, 9:50pm

My question is related to datasets management in Prodigy. We have multiple customers and multiple datasets per customer. The datasets are stored in separate databases, one for each customer. Once our annotators work on specific customers we have to create multiple datasets. Some of them are intermediary, others are merged and some are for testing. However, we usually have one or two viable datasets that can be used for training. This information currently has to be stored in a separate spreadsheet where we track the viable datasets. Another way we are thinking is to create another table per customer where this information can be stored and a user can manually set a flag for each viable dataset.
I am wondering if it is possible to do this within the Prodigy database? Something like set a flag maybe in the dataset metadata which identifies the viable datasets
I am wondering if there is any way to do this in Prodigy

ines · January 15, 2021, 2:00am

Hi! This sounds like a reasonable workflow Just to make sure I understand the exact requirement: you basically want to attach meta information to existing datasets, and that meta information may change, so you need to update it? For example, whether the current set is a viable dataset?

By default, the datasets table in the database does have a meta field that can contain any JSON-serializable meta information. It can be accessed via the Database.get_meta method but we're currently not exposing a method to update the dataset meta. However, you could implement this yourself in a little helper recipe/script:

from prodigy.components.db import Dataset, connect
import json

def update_dataset_meta(name: str, meta: dict):
    dataset = Dataset.get(Dataset.name == name)
    dataset.update(meta=json.dumps(meta)).execute()  # this overrides the meta dict!

You can then do things like this:

db = connect()
db.add_dataset("my_cool_dataset", meta={"is_viable": False})
print(db.get_meta("my_cool_dataset"))  # {"is_viable": False, "created": ..}
update_dataset_meta("my_cool_dataset", {"is_viable": True})
print(db.get_meta("my_cool_dataset"))  # {"is_viable": True, "created": ..}

gustav · October 19, 2021, 2:28pm

For some reason the answer above changed all my datasets meta data. Amended the method as follows to get it to work.


def update_dataset_meta(name: str, meta: dict):
    dataset = Dataset.get(Dataset.name == name)
    dataset.update(meta=json.dumps(meta)).where(
        Dataset.id == dataset.id
    ).execute()

Topic		Replies	Views
How to edit existing texts that were added to a dataset using db-in ner , database	3	1076	February 3, 2020
keeping information from training data in the dataset usage , database , solved	1	422	January 29, 2020
Adding "meta" to the dataset target in recipe usage , database , solved	1	416	November 12, 2021
How do tables map to datasets in prodigy DB? database , solved	2	735	December 13, 2019
Annotate multiple JSONL into multiple Datasets usage , database , solved , streams	2	552	October 7, 2021

Dataset management

Related topics