Prodigy uses the peewee
package to manage the database integration, which should hopefully give you a lot of flexibility in terms of setup and debugging. (It also allows custom configuration, in case you need it.)
If I’m not mistaken, Prodigy only uses the basic operations, so once the tables are created, you should be able to get by with SELECT
, INSERT
, UPDATE
and DELETE
. You can probably even leave out DELETE
, since it’s only ever used if you run the prodigy drop
command to delete datasets. Prodigy needs the tables Dataset, Example and Link (and will try to CREATE
them if they don’t exist).
In your PRODIGY_README.html
, you can find an overview of the available database methods. So in order to test the database connection, you can also write a little script that performs the most important operations:
from prodigy.components.db import connect
db = connect() # uses the settings in your prodigy.json
db.add_dataset('test_dataset') # add a dataset
assert len(db) == 1
assert len(db.datasets) == 1
assert 'test_dataset' in db
print(db.datasets) # ['test_dataset']
examples = [{'text': 'hello world', '_task_hash': 123, '_input_hash': 456}]
db.add_examples(examples, ['test_dataset']) # add examples to the dataset
dataset = db.get_dataset('test_dataset') # retrieve a dataset
assert len(dataset) == 1
Btw, if you’re using the MySQLdb driver, you might have to use "name"
, "dbname"
or "database"
(instead of "db"
) to specify the database name in your prodigy.json
. See this thread for more details – this will be fixed in the next release.