Extracting annotations from a database using a custom recipe

Hi.
We've been using a custom recipe to connect to our PostgresQL database, using the PostgresqlExtDatabase method from playhouse, so that we can input the user/pass/host as environment variables and avoid pushing this information to the repository. Like so

db = PostgresqlExtDatabase(os.getenv('PRODIGY_DB_NAME'),
                         user=os.getenv('PRODIGY_DB_USER'),
                         password=os.getenv('PRODIGY_DB_PASSWORD'),
                         host=os.getenv('PRODIGY_DB_HOST'),
                         register_hstore=False)

I was wondering if there's an equivalent to using the prodigy db-out command, but via a custom recipe, so that we can continue using this method to connect to our database, or if there's another recommendation for accomplishing this goal?

Hi! When you pass your custom db into Prodigy's Database class, you get an instance of Prodigy's database object that has various methods for retrieving annotations and hashes, adding examples, adding datasets and so on. You can find the full API documentation in your PRODIGY_README.html.

If you want to get the annotations in a given dataset, you can do something like this:

prodigy_db = Database(db)
examples = prodigy_db.get_dataset("your_dataset_name")

examplesis a list of dicts that you can then save however you like. You might also find our little helper library srsly helpful, especially for writing JSONL. It's also what Prodigy uses under the hood.

import srsly
srsly.write_jsonl("/path/to/data.jsonl", examples)
1 Like