Dataset download from web interface

FourthPartyAI · November 26, 2024, 9:46pm

Hi I'm relatively new to prodigy.

Someone at my company is doing annotations for me, but they are non-technical and cannot really interact with Python. I've dockerized the spans.manual recipe to provide a simple web interface. For some technical reasons my company isn't able to setup the container in the cloud with an external database at this time. I can write a small script that the user can use to launch the docker container on their machine, but I have no way to access the annotations. I would like to keep everything containerized and avoid setting up databases on their local machine for the time being.

Is there a mechanism for modifying the spans.manual recipe to allow spacy dataset download from the web interface?

I was going to look into the recipe to see if it could be modified, but I don't see it in the recipe repo. Is this recipe available for modification?

magdaaniol · November 27, 2024, 1:32pm

Welcome to the forum @FourthPartyAI ,

Adding a download button to the UI is definitely possible via Prodigy custom events.
In a nutshell, you could add a download button to the UI, hook it to a custom javascript event that would call Database API (concretely db.get_dataset_examples) and save the result to disk.
Please see the docs on custom interfaces with HTML, CSS and javascript I linked above to see some examples of similar UI extensions.

That said, modifying the UI is not necessarily the simplest solution for what you need, imo. You could also just add a background script to your docker that would periodically save the dataset to disk. You could leverage Prodigy database CLI for this. The background script could look something like (assuming you mounted a Volume in a path /app/backups and DATASET_NAME is specified as ENV var) :

#!/bin/bash

if [ -z "$1" ]; then
    echo "Error: Dataset name must be provided"
    echo "Usage: $0 <dataset_name>"
    exit 1
fi

DATASET_NAME="$1"

while true; do
    echo "Creating backup of dataset: $DATASET_NAME"
    
    # Create timestamp for backup
    timestamp=$(date +%Y%m%d_%H%M%S)
    
    # Export the dataset
    echo "Backing up dataset: $DATASET_NAME"
    prodigy db-out $DATASET_NAME "/app/backups/${DATASET_NAME}_${timestamp}.jsonl"
    
    echo "Backup completed at $(date)"
    
    # Wait for the specified interval (default: 1 hour)
    sleep ${BACKUP_INTERVAL:-60}
done

Alternatively, you could also implement a super simple FastAPI app with an endpoint that calls db.get_dataset_examples mentioned above and lets the user download the file.

# app.py
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse, HTMLResponse
from prodigy.components.db import connect
import json
from io import StringIO

app = FastAPI()

@app.get("/datasets/{dataset_name}")
async def export_dataset(dataset_name: str):
    """Export a specific dataset as JSONL"""
    try:
        db = connect()
        examples = db.get_dataset_examples(dataset_name)
        output = StringIO()
        for example in examples:
            output.write(json.dumps(example) + '\n')
        output.seek(0)
        return StreamingResponse(
            iter([output.getvalue()]),
            media_type="application/jsonl",
            headers={"Content-Disposition": f"attachment; filename={dataset_name}.jsonl"}
        )
    except Exception as e:
        raise HTTPException(status_code=404, detail=f"Dataset '{dataset_name}' not found")

@app.get("/datasets", response_class=HTMLResponse)
async def list_datasets():
    """Show list of datasets with download links"""
    db = connect()
    datasets = db.datasets
    
    links = "\n".join([
        f'<li><a href="/datasets/{dataset}">{dataset}</a></li>'
        for dataset in sorted(datasets)
    ])
    
    return f"""
    <!DOCTYPE html>
    <html>
        <head>
            <title>Prodigy Datasets</title>
            <style>
                body {{ font-family: system-ui; max-width: 800px; margin: 2rem auto; padding: 0 1rem; }}
                li {{ margin: 0.5rem 0; }}
                a {{ color: #2563eb; text-decoration: none; }}
                a:hover {{ text-decoration: underline; }}
            </style>
        </head>
        <body>
            <ul>
                {links}
            </ul>
        </body>
    </html>
    """

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Now when your colleague accesses http://localhost:8000/datasets, they should see a list of Prodigy datasets that can be clicked to download:

Finally, you can access the recipe source code in your Prodigy installation path (you can run python -m prodigy stats to get that path) inside recipes folder

Topic		Replies	Views
Tip: Turn prodigy.db into web interface & JSON API with datasette usage	0	678	November 14, 2017
Prodigy Datasets	1	249	March 20, 2023
Loading a dataset from the DB instead of from disk/api? usage , solved	4	1974	March 6, 2018
How to access stored annotation files?	4	111	June 17, 2024
How to download the dataset I annotated using the prodigy tool in json format？ Getting Started database	3	1126	March 6, 2023

Dataset download from web interface

Related topics