Loading pre-annotated data

Hi. I would like to load a pre-annotated (outside of prodigy) dataset. Is this possible to see the label already selected and then view the changelog?

Prodigy's input and output formats are identical, so you can always load in already annotated data, or generate the JSONL format programmatically. You can see an overview of the expected formats for the different interfaces here: https://prodi.gy/docs/api-interfaces

For example, if you're using the choice UI with multiple options, you can provide a list of "accept": ["LABEL_A", "LABEL_B"] to pre-select options in the UI. This will then be updated if you make changes in the UI. If you want to preserve the original annotations, you can just add them as a separate key to the JSON you send out, e.g. "orig_selection": [...]. This is passed through with the data, so to find out whether the annotations have changed, you just need to compare accept and orig_selection.

1 Like

@ines I have a follow-up question here. I loaded some pre-annotated data that I want to re-annotate and do some error analysis on using this command, which seemed to load the data.

(prodigy) cheyannebaird@Cheyannes-MacBook-Pro:~/.prodigy$ prodigy db-in hbm_error_analysis /path_to_data/data.jsonl
✔ Imported 170 annotations to 'hbm_error_analysis' (session
2022-10-17_08-23-40) in database PostgreSQL

How do I then load this into the UI with the highlighted annotated label?

How I normally load the data locally:

PRODIGY_ALLOWED_SESSIONS=cheyanne PRODIGY_LOGGING=verbose prodigy recipe-name hbm_error_analysis  /path_to_data/data.jsonl -F /path_to_recipe/my_recipe.py

Screen Shot 2022-10-17 at 8.30.13 AM

A small detail: there's no need to @ a forum member for support. We all round robin the issues and cannot guarantee that the same person picks up the ticket/thread. Also: apologies for the delay! Life with a newborn sure is hectic.

Am I correct to see that you're using a custom recipe here?

If so, your recipe-name recipe currently uses a reference to a /path_to_data/data.jsonl file. Internally in your recipe, I'm assuming it uses something like below to load in the examples:

from prodigy.components.loaders import JSONL 

stream = JSONL(data_path)

But if you want to pass in the name of a dataset that already exists, you could do something like:

from prodigy.components.db import connect

dataset_name = "hbm_error_analysis"
db = connect()                                  # uses settings from prodigy.json
stream = db.get_dataset(dataset_name)     # retrieve a dataset

Would this work?