Uploading a dataset of images

Hi,

I’m new to prodigy. Is there any way to upload a dataset of images to the database through python? I can see that it is possible to do it through the terminal, but have not found any examples of uploading images through python.

Thanks :slight_smile:

Hi and welcome! :slightly_smiling_face:

What do you mean by “uploading images”, and what are you trying to do? If you just want to annotate images, there’s no need to upload anything – you can directly pass in a directory of images as the source argument on the command line, and they’ll be streamed in for annotation. As you annotate them, the annotations will be stored in the database.

If you already have image annotations and want to import them to a Prodigy dataset, you can also do that. Just make sure that the examples are formatted in Prodigy’s JSON format. You can find examples of this in the “Annotation task formats” section of your PRODIGY_README.html. You can then either use the db-in command on the command line, or connect to the database in Python:

from prodigy.components.db import connect
db = connect()
db.add_dataset("new_dataset")  # create a new dataset
db.add_examples(some_new_examples, ["new_dataset"])  # add examples to the dataset

You can also find more details on the available database methods in the “DB” section in the Readme.

Thanks for the response!

In our case the easiest thing would be to upload the images straight from python. Even though, I have read the documentation I am still confused in how to convert pictures to the correct format for the upload.

The example code below returns for example the error:

TypeError: string indices must be integers

Example code:

from prodigy.components.db import connect
import json
import base64
import os
import numpy as np

def extract_pic(path):
    with open(path, 'rb') as f:
        img = f.read()
    encrypted = base64.encodebytes(img).decode("utf-8")
    d = {'image': encrypted, 'label': str(np.random.randint(0, 2))}
    pic_json = json.dumps(d)
    return pic_json


def main(dataset_name):
    pic_dir = "../../Downloads/images"
    file_names = os.listdir(pic_dir)
    paths = map(lambda f: os.path.join(pic_dir, f), file_names)
    db = connect()
    data_json = list(map(lambda p: extract_pic(p), paths))
    db.add_dataset("ImageTest")  # create a new dataset
    db.add_examples(data_json, ["ImageTest"])  # add examples to the dataset

if __name__ == '__main__':
    main('python upload')

Your code looks fine to me and you definitely did the “hard part” correctly. Could you post the full traceback of that error and where it occurs? It sounds like something somewhere ended up being a string but should be a dict :thinking: (At least, that’s usually the classic cause of that error.)

Btw, one small thing: If the examples you’re adding are correct annotations, you should also add an "answer": "accept" to the data dict, so Prodigy knows that it’s a positive example.

Here is the traceback:

Traceback (most recent call last):
File "/home/rosa/PycharmProjects/Example_recipe/upload.py", line 36, in
main('python upload')
File "/home/rosa/PycharmProjects/Example_recipe/upload.py", line 31, in main
db.add_examples(data_json, ["ImageTest"]) # add examples to the dataset
File "/home/rosa/miniconda3/lib/python3.7/site-packages/prodigy/components/db.py", line 378, in add_examples
input_hash=eg[INPUT_HASH_ATTR],
TypeError: string indices must be integers

Thanks! And I think I found the problem: Before you’re returning pic_json, you’re calling json.dumps on it, which turns it into a string. However, examples added to the database are expected to be just a regular list of dictionaries. So you can drop that line :slightly_smiling_face:

I also just noticed that add_examples currently expects the examples to already have hashes set (an "_input_hash" representing the input data, e.g. the image, and a "_task_hash" representing the specific question, e.g. the image with a given label). You can either generate them yourself, or let Prodigy do it for you:

from prodigy import set_hashes
# in your code
pic_json = set_hashes(pic_json)
return pic_json

Thanks! That worked :smiley:

1 Like