How do tables map to datasets in prodigy DB?

westofpluto · December 12, 2019, 9:43pm

I just purchased Prodigy and am working through the documentation. I ran the simple DB test script (found in the readme file) that tests out the database. I used all defaults so it created prodigy.db in my default home folder ~/.prodigy. The script is:

from prodigy.components.db import connect
db = connect()
db.add_dataset('test_dataset')
assert 'test_dataset' in db
examples = [{'text': 'hello world', '_task_hash': 123, '_input_hash': 456}]
db.add_examples(examples, ['test_dataset'])
dataset = db.get_dataset('test_dataset')
assert len(dataset) == 1

The first time I run this it works fine. The second time I run this it fails because I think the dataset now has length 2. But when I use the sqlite3 command line tool to see what is in prodigy.db, I use the .dump command and all I see is this:

sqlite> .dump

PRAGMA foreign_keys=OFF;

BEGIN TRANSACTION;

COMMIT;

sqlite>

So where are the examples being stored? How does a "dataset" map to tables? There seem to be no tables in the sqlite3 database. Is a db.save() command being performed automatically?

Please explain in more detail what is happening and where the data is being saved because I don't see it in my database.

westofpluto · December 12, 2019, 10:03pm

Ok I figure it out. I used the wrong sqlite3 command: I used "sqlite3 prodigy" instead of "sqlite3 prodigy.db". The first command just creates an empty database in the file called "prodigy". The second opens the actual database in file prodigy.db. The .db.add_dataset('test_dataset') command creates a table called dataset (if it doesn't exist yet) and adds a record with several fields, one being the name of the dataset (test_dataset). The command db.add_examples(...) creates the table called 'examples' (if it doesn't already exist) and then adds the example as a record in that table. At some point in this process, it also adds a record to a table called 'link' that has foreign keys to both 'dataset' and 'example' to show that these examples belong to this dataset.

I would have liked to see this level if detail in the documentation, but at least I know how it works now.

ines · December 13, 2019, 9:45am

Glad you found the answer! Adding the table info more prominently to the docs is a good idea. At the moment the tables that are created are only really mentioned in the section on permissions. If you haven't seen it yet, you can find the API docs of the Database class in your PRODIGY_README.html.

Here are the tables added by Prodigy and what's in them (will also copy that info over to the docs later ):

Table	Description
`Dataset`	The dataset IDs and dataset meta.
`Example`	The individual annotation examples. Each example is only added once, so if you add the same annotation to multiple datasets, it'll only have on record here.
`Link`	Example IDs linked to datasets. This is how Prodigy knows which examples belong to which datasets.

Topic		Replies	Views
Old examples are automatically added to new dataset done , database	15	2042	March 25, 2019
dataset error database	5	1291	December 18, 2019
Tip: Turn prodigy.db into web interface & JSON API with datasette usage	0	678	November 14, 2017
Is there a faster way to add records to a prodigy db than "add_examples"? done , database , solved	4	672	March 25, 2019
Annotated data being saved to wrong dataset (race condition when saving data) database	6	595	May 16, 2019

How do tables map to datasets in prodigy DB?

Related topics