Labeling existing data in Postgres table

We have a large dataset of text and we want to attach a data label to every data record. All ML pipelines are reading from the database (Postgres) and we run processes to convert the Postgres data into tf records, etc.

We are looking into Prodigy as our standard annotation tool. Can we configure Prodigy to read from specific tables and write the human label to a given label field?

All examples I saw relied on the export of the data to the jsonl format, which is then read by Prodigy. I also found examples to configure Prodigy to use Postgres for the data storage, but I couldn't figure out how to configure Prodigy read/write from a specific table.

Is it generally possible to read data straight from a Postgres table without the export step?

Hi! What you describe should definitely be no problem to implement :slightly_smiling_face: Prodigy streams are Python generators that yield dictionaries – so if you can load your data in Python, you can use it in Prodigy. Here's a super minimal example that shows the concept:

def stream():
    texts = load_your_texts_from_postgres()
    for text in texts:
        # You can also add custom properties like a database ID here
        yield {"text": text}

The above stream would then be returned by your custom recipe. Prodigy can also read streams from standard input, so if you prefer that, you could write a script that outputs the records and then pipe it forward to the recipe (see here for an example). Once you're getting more serious about this, you can even register a custom loader via entry points, so you can specify something like --loader custom-postgres on the command line.

Saving to Postgres is supported out-of-the-box – although, you might want to customise this a little to fix into your existing workflow. When Prodigy saves annotations to the database, it'll store each annotated example as a JSON blob. This makes sense for the default use case, because we can't make any assumptions about the data in the task dict. However, if you know what your fields are and what you're annotating, you might want to customise that as well.

Recipes can return an update callback that's executed whenever new annotations are sent back to the server. It's typically used to update models in the loop etc., but you can also use it to save examples to your database:

def update(examples):
    # This is called when Prodigy receives a batch of annotated examples
    print(f"Saved {len(examples)} examples")

The incoming examples are also dicts in Prodigy's format. For instance, if you're annotating named entities, they'd contain a list of "spans", which you could then add to the respective table in your database. Custom properties are passed through, so when you stream in the examples, you can attach an ID that then lets you associate the example with a record in your database when you get it back later.

If you need more flexibility, you can also pass in a fully custom Database class that implements a few pre-defined methods and tells Prodigy how to create datasets and save examples (see here for details).

1 Like

Hi Ines,

Thank you for your detailed and fast reply. We'll most likely will get a Prodigy license and integrate it into our pipelines.

Viele Grüße,