Dataset management

My question is related to datasets management in Prodigy. We have multiple customers and multiple datasets per customer. The datasets are stored in separate databases, one for each customer. Once our annotators work on specific customers we have to create multiple datasets. Some of them are intermediary, others are merged and some are for testing. However, we usually have one or two viable datasets that can be used for training. This information currently has to be stored in a separate spreadsheet where we track the viable datasets. Another way we are thinking is to create another table per customer where this information can be stored and a user can manually set a flag for each viable dataset.
I am wondering if it is possible to do this within the Prodigy database? Something like set a flag maybe in the dataset metadata which identifies the viable datasets
I am wondering if there is any way to do this in Prodigy

Hi! This sounds like a reasonable workflow :slightly_smiling_face: Just to make sure I understand the exact requirement: you basically want to attach meta information to existing datasets, and that meta information may change, so you need to update it? For example, whether the current set is a viable dataset?

By default, the datasets table in the database does have a meta field that can contain any JSON-serializable meta information. It can be accessed via the Database.get_meta method but we're currently not exposing a method to update the dataset meta. However, you could implement this yourself in a little helper recipe/script:

from prodigy.components.db import Dataset, connect
import json

def update_dataset_meta(name: str, meta: dict):
    dataset = Dataset.get(Dataset.name == name)
    dataset.update(meta=json.dumps(meta)).execute()  # this overrides the meta dict! 

You can then do things like this:

db = connect()
db.add_dataset("my_cool_dataset", meta={"is_viable": False})
print(db.get_meta("my_cool_dataset"))  # {"is_viable": False, "created": ..}
update_dataset_meta("my_cool_dataset", {"is_viable": True})
print(db.get_meta("my_cool_dataset"))  # {"is_viable": True, "created": ..}

For some reason the answer above changed all my datasets meta data. Amended the method as follows to get it to work.


def update_dataset_meta(name: str, meta: dict):
    dataset = Dataset.get(Dataset.name == name)
    dataset.update(meta=json.dumps(meta)).where(
        Dataset.id == dataset.id
    ).execute()  

1 Like