Labeling existing data in Postgres table

Hi! What you describe should definitely be no problem to implement :slightly_smiling_face: Prodigy streams are Python generators that yield dictionaries – so if you can load your data in Python, you can use it in Prodigy. Here's a super minimal example that shows the concept:

def stream():
    texts = load_your_texts_from_postgres()
    for text in texts:
        # You can also add custom properties like a database ID here
        yield {"text": text}

The above stream would then be returned by your custom recipe. Prodigy can also read streams from standard input, so if you prefer that, you could write a script that outputs the records and then pipe it forward to the recipe (see here for an example). Once you're getting more serious about this, you can even register a custom loader via entry points, so you can specify something like --loader custom-postgres on the command line.

Saving to Postgres is supported out-of-the-box – although, you might want to customise this a little to fix into your existing workflow. When Prodigy saves annotations to the database, it'll store each annotated example as a JSON blob. This makes sense for the default use case, because we can't make any assumptions about the data in the task dict. However, if you know what your fields are and what you're annotating, you might want to customise that as well.

Recipes can return an update callback that's executed whenever new annotations are sent back to the server. It's typically used to update models in the loop etc., but you can also use it to save examples to your database:

def update(examples):
    # This is called when Prodigy receives a batch of annotated examples
    add_annotations_to_postgres(examples)
    print(f"Saved {len(examples)} examples")

The incoming examples are also dicts in Prodigy's format. For instance, if you're annotating named entities, they'd contain a list of "spans", which you could then add to the respective table in your database. Custom properties are passed through, so when you stream in the examples, you can attach an ID that then lets you associate the example with a record in your database when you get it back later.

If you need more flexibility, you can also pass in a fully custom Database class that implements a few pre-defined methods and tells Prodigy how to create datasets and save examples (see here for details).

1 Like