db-out is giving a empty .jsonl file

Below command is how I use to start the prodigy recipe to annotate page detection on my server

/usr/bin/python3.9 -m prodigy bdrc-crop-images-recipe bdrc_crop '/usr/local/prodigy/prodigy-tools/data/page_cropping.csv' -F /usr/local/prodigy/prodigy-tools/recipes/bdrc_crop_images.py

where bdrc-crop-images-recipe is the name of the custom recipe,
bdrc_crop is the name of the dataset

annotations are saved at the /usr/local/prodigy/bdrc_crop_images.sqlite

I used /usr/bin/python3.9 -m prodigy db-out bdrc_crop > ./bdrc_crop_images.jsonl, with and without the sudo in front of it but I get an empty bdrc_crop_images.jsonl file.

hi @ngawangtrinley!

Thanks for your question and welcome to the Prodigy community :wave:

For your dataset, can you run:

from prodigy.components.db import connect

db = connect()
examples = DB.get_dataset_examples("bdrc_crop")
print(examples[0]) # or len(examples) to get count of how many examples

In addition to using srsly.write_jsonl(), this is essentially what db-out is doing. If this doesn't work, then the problem isn't db-out, but that you're not saving data to the dataset.

If you find the problem isn't db-out, can you provide your custom recipe?

Below image is the result of using what you recommended above, we get empty examples

Below python script is our custom recipe

import csv
import logging

import prodigy

from tools.config import PAGE_CROPPING_BUCKET, page_cropping_s3_client

# s3 cofig
s3_client = page_cropping_s3_client
bucket_name = PAGE_CROPPING_BUCKET


# log config 
logging.basicConfig(
    filename="/usr/local/prodigy/logs/bdrc_crop_images.log",
    format="%(levelname)s: %(message)s",
    level=logging.INFO,
    )

# Prodigy has a logger named "prodigy" according to 
# https://support.prodi.gy/t/how-to-write-log-data-to-file/1427/10
prodigy_logger = logging.getLogger('prodigy')
prodigy_logger.setLevel(logging.INFO)

@prodigy.recipe("bdrc-crop-images-recipe")
def bdrc_crop_images_recipe(dataset, csv_file):
    logging.info(f"dataset:{dataset}, csv_file_path:{csv_file}")
    obj_keys = []
    with open(csv_file) as _file:
        for csv_line in list(csv.reader(_file, delimiter=",")):
            s3_key = csv_line[0]
            # TODO: filter non-image files
            obj_keys.append(s3_key)
    return {
        "dataset": dataset,
        "stream": stream_from_s3(obj_keys),
        "view_id": "image_manual",
        "config": {
            "labels": ["PAGE"]
        }
    }


def stream_from_s3(obj_keys):
    for obj_key in obj_keys:
        image_url = s3_client.generate_presigned_url(
            ClientMethod="get_object",
            Params={"Bucket": bucket_name, "Key": obj_key},
            ExpiresIn=31536000
        )
        image_id = (obj_key.split("/"))[-1]
        yield {"id": image_id, "image": image_url}

below is our configuration.json file

{
  "theme": "basic",
  "custom_theme": { "cardMaxWidth": 2000 },
  "buttons": ["accept", "reject", "ignore", "undo"],
  "batch_size": 10,
  "history_size": 10,
  "port": 8090,
  "host": "localhost",
  "cors": true,
  "db": "sqlite",
  "db_settings": {
    "sqlite": {
      "name": "bdrc_crop_images.sqlite",
      "path": "/usr/local/prodigy"
    }
  },
  "validate": true,
  "auto_exclude_current": true,
  "instant_submit": true,
  "feed_overlap": false,
  "auto_count_stream": false,
  "total_examples_target": 0,
  "ui_lang": "en",
  "project_info": ["dataset", "session", "lang", "recipe_name", "view_id", "label"],
  "show_stats": false,
  "hide_meta": false,
  "show_flag": false,
  "instructions": false,
  "swipe": false,
  "swipe_gestures": { "left": "accept", "right": "reject" },
  "split_sents_threshold": false,
  "html_template": false,
  "global_css": null,
  "javascript": null,
  "writing_dir": "ltr",
  "show_whitespace": false,
  "exclude_by": "task"
}

below is the .service which is at /etc/systemd/system/prodigy_bdrc_crop_images.service

[Unit]
Description=Prodigy for images
After=syslog.target network.target

[Service]
Type=simple

SyslogIdentifier=prodigy_img
Environment=PRODIGY_HOME="/usr/local/prodigy"
Environment=PRODIGY_LOGGING=verbose
Environment=PRODIGY_CONFIG="/usr/local/prodigy/prodigy-tools/configuration/bdrc_crop_images.json"
WorkingDirectory=/usr/local/prodigy
ExecStart=/usr/bin/python3.9 -m prodigy bdrc-crop-images-recipe bdrc_crop '/usr/local/prodigy/prodigy-tools/data/page_cropping.csv' -F /usr/local/prodigy/prodigy-tools/recipes/bdrc_crop_images.py

User=prodigy
Group=prodigy

UMask=0007
RestartSec=10
Restart=always

[Install]
WantedBy=multi-user.target

hi @ngawangtrinley-

From your image, it looks like there are no annotated examples, so it's not db-out that's the problem.

Glancing at your recipe, I don't see any major problems at first. Since you have the s3 bucket, I can't reproduce the same example but I don't see any problems.

Perhaps this is obvious, but can you confirm that you had annotations that were saved to the DB? Prodigy serves examples in batches (by default in batches of 10), so examples aren't saved into the DB until either you've completed a batch (and it fetches a new batch) or when you click save in the UI. Since you're logging, can you confirm that you've seen in the logs the annotations were saved to the DB?

There are lines like the following in the log yes (AWSAccessKeyId changed by me):

INFO: POST: /give_answers (received 1, session ID 'bdrc_crop-tenpa')
[{'id': 'I00KG023520048.tif_19.png', 'image': 'https://s3.amazonaws.com/image-processing.bdrc.io/Works/26/W00KG02331/images-web/W00KG02331-I00KG02352/I00KG023520048.tif_19.png?AWSAccessKeyId=XXX%3D&Expires=1711016507', '_input_hash': -346017609, '_task_hash': -1423144248, '_view_id': 'image_manual', 'width': 2000, 'height': 1538, 'spans': [{'id': '11fccc0c-8c99-4834-a9bf-519d263385d5', 'label': 'PAGE', 'color': 'yellow', 'x': 10.3, 'y': 17.7, 'height': 1498.8, 'width': 970.3000000000001, 'center': [495.45000000000005, 767.1], 'type': 'rect', 'points': [[10.3, 17.7], [10.3, 1516.5], [980.6, 1516.5], [980.6, 17.7]]}, {'id': 'd2cd1bf4-39bc-4054-83d9-f3b38f8370a5', 'label': 'PAGE', 'color': 'cyan', 'x': 983.7, 'y': 17.7, 'height': 1495.8, 'width': 970.3, 'center': [1468.85, 765.6], 'type': 'rect', 'points': [[983.7, 17.7], [983.7, 1513.5], [1954, 1513.5], [1954, 17.7]]}], 'answer': 'accept'}]

and if I inspect the sqlite file I'm getting some data:

sudo -u prodigy sqlite3 bdrc_crop_images.sqlite
sqlite> SELECT COUNT(*) FROM example;
514
sqlite>SELECT COUNT(*) FROM dataset;
6

what else could we test?

Can you run prodigy stats <dataset-name>?

This should print off something like this:

============================== ✨  Dataset Stats ==============================

Dataset       news_topics        
Created       2023-03-29 15:55:44
Description   None               
Author        None               
Annotations   13                 
Accept        13                 
Reject        0                  
Ignore        0  

below image is what I got after running the recommended command to show prodigy stats

After a bit of debugging, it tuned out that we had to set the PRODIGY_CONFIG environment variable, so

sudo -u prodigy PRODIGY_CONFIG="/usr/local/prodigy/prodigy-tools/configuration/bdrc_crop_images.json" /usr/bin/python3.9 -m prodigy db-out bdrc_crop

works.