Welcome to the forum @FourthPartyAI ,
Adding a download button to the UI is definitely possible via Prodigy custom events.
In a nutshell, you could add a download button to the UI, hook it to a custom javascript event that would call Database API (concretely db.get_dataset_examples
) and save the result to disk.
Please see the docs on custom interfaces with HTML, CSS and javascript I linked above to see some examples of similar UI extensions.
That said, modifying the UI is not necessarily the simplest solution for what you need, imo. You could also just add a background script to your docker that would periodically save the dataset to disk. You could leverage Prodigy database CLI for this. The background script could look something like (assuming you mounted a Volume in a path /app/backups
and DATASET_NAME
is specified as ENV var) :
#!/bin/bash
if [ -z "$1" ]; then
echo "Error: Dataset name must be provided"
echo "Usage: $0 <dataset_name>"
exit 1
fi
DATASET_NAME="$1"
while true; do
echo "Creating backup of dataset: $DATASET_NAME"
# Create timestamp for backup
timestamp=$(date +%Y%m%d_%H%M%S)
# Export the dataset
echo "Backing up dataset: $DATASET_NAME"
prodigy db-out $DATASET_NAME "/app/backups/${DATASET_NAME}_${timestamp}.jsonl"
echo "Backup completed at $(date)"
# Wait for the specified interval (default: 1 hour)
sleep ${BACKUP_INTERVAL:-60}
done
Alternatively, you could also implement a super simple FastAPI app with an endpoint that calls db.get_dataset_examples
mentioned above and lets the user download the file.
# app.py
from fastapi import FastAPI, HTTPException
from fastapi.responses import StreamingResponse, HTMLResponse
from prodigy.components.db import connect
import json
from io import StringIO
app = FastAPI()
@app.get("/datasets/{dataset_name}")
async def export_dataset(dataset_name: str):
"""Export a specific dataset as JSONL"""
try:
db = connect()
examples = db.get_dataset_examples(dataset_name)
output = StringIO()
for example in examples:
output.write(json.dumps(example) + '\n')
output.seek(0)
return StreamingResponse(
iter([output.getvalue()]),
media_type="application/jsonl",
headers={"Content-Disposition": f"attachment; filename={dataset_name}.jsonl"}
)
except Exception as e:
raise HTTPException(status_code=404, detail=f"Dataset '{dataset_name}' not found")
@app.get("/datasets", response_class=HTMLResponse)
async def list_datasets():
"""Show list of datasets with download links"""
db = connect()
datasets = db.datasets
links = "\n".join([
f'<li><a href="/datasets/{dataset}">{dataset}</a></li>'
for dataset in sorted(datasets)
])
return f"""
<!DOCTYPE html>
<html>
<head>
<title>Prodigy Datasets</title>
<style>
body {{ font-family: system-ui; max-width: 800px; margin: 2rem auto; padding: 0 1rem; }}
li {{ margin: 0.5rem 0; }}
a {{ color: #2563eb; text-decoration: none; }}
a:hover {{ text-decoration: underline; }}
</style>
</head>
<body>
<ul>
{links}
</ul>
</body>
</html>
"""
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Now when your colleague accesses http://localhost:8000/datasets, they should see a list of Prodigy datasets that can be clicked to download:
Finally, you can access the recipe source code in your Prodigy installation path (you can run python -m prodigy stats
to get that path) inside recipes
folder