Customize the JSON format when saving the annotations in database?

  1. Is there a way to customize the JSON format when saving the annotations in database?

{"text":"this is an example of some_location.","_input_hash":79275707,"_task_hash":565275264,"spans":[{"start":23,"end":35,"text":"some_location","rank":0,"label":"LOC","score":0.6479370919,"source":"en_core_web_lg","input_hash":79275707}],"meta":{"score":0.6479370919},"answer":"accept"}

I would like to customize the above format.

  1. how to modify the default schema?

What exactly would you like to customise?

If you want to add properties, you can include them when you load in your data in JSONL format. As long as it’s valid JSON, your custom properties will be passed through and saved in the database with the annotations. For example:

{"text": "Some text", "custom_id": 123}

Internally, Prodigy uses the JSON format to communicate annotation tasks. Depending on the database you’re using, the data is then converted to the respective database fields and formats. If you want to export your data in a different format, you can always interact with the database directly, request the dataset and then export it:

from prodigy.components.db import connect

db = connect()  # uses the settings from your prodigy.json
examples = db.get_dataset('your_dataset_name')
# `examples` is a list of dictionaries in Prodigy's format – you can
# now convert it however you want, and save it out in any file format

You can find more details on the database methods in your PRODIGY_README.html.

When modifying the dataset, you usually want to keep a copy of the original data. That’s also the reason Prodigy prevents you from overwriting annotations directly before they’re saved. The original dataset should always reflect the exact data that came back from the annotator – otherwise, you can never be sure that the labels you’re training on are really what was annotated and it’s too easy to accidentally destroy data if there’s a bug in your code.

Thanks for your prompt response.

While using Postgres, prodigy has a fixed schema having tables of dataset, example and link and their corresponding column structure. However, i would like to extract some parts from that data and save it under my custom database schema and format.

Is there a way to override prodigy's database behaviour for saving annotations and align it with custom schema and format.

If not, what would you suggest for the same? (basically i want to save the completed annotations but in a different format under custom database schema by extracting spans and text from the output prodigy is producing for saved annotations)

Ah okay, thanks for the clarification! There are two main options:

1. Create a custom Database class

The first one would be to plug in your own Database class. You can find the detailed API documentation of the structure Prodigy expects in your PRODIGY_README.html. Your custom class needs to implement the same methods as Prodigy’s built-in class. You can then plug it into Prodigy via the 'db' setting returned by the recipe. For example:

return {
    'dataset': dataset,
    'db': YourCustomDB(),
    # other stuff
}

(In the upcoming version of Prodigy, you’ll also be able to wrap your custom loader as a Python package and expose it via the entry points. You can then simply set "db": "your_custom_db" in your prodigy.json and won’t have to customise any recipe.)

2. Send the answers to your database via the update callback

Alternatively, you could also send a copy of every batch of annotation to your database and then store it however you like. Recipes can define an optional update method that is called with a list of annotations every time Prodigy receives a new batch.

# in your recipe
def update(answers):
    # do something here
    ADD_ANSWERS_TO_YOUR_DB(answers)
   
return {
    'dataset': dataset,
    'update': update,
   # etc.
}

In theory, you could also set 'db': False if you go for this approach. This will disable Prodigy’s built-in database. But it also means that you could lose data if something goes wrong – so you might still want to keep at least a local SQLite backup, just in case.