How to export annotation of image manual without image string base64

irsal2009 · October 30, 2019, 10:58pm

I am just new on using prodigy. I already run image.manual and annotate the image with bounding box. And then I exported the annotation using db-out. However, on the image key it shows the string base64 of the image. What do I need to do in order to get annotation with just file name of image without any base64 string at all?
Thank you very much.

ines · October 31, 2019, 10:11am

Hi! What you get out at the end is always what you load in (which is kind of a core principle so you don't lose any data). If you don't want the images to be encoded, you can load in the image URLs from a JSONL file instead of the image directory.

I've posted an example of this in this thread the other day:

image.manual returns ValueError: Unmatched ''"' when when decoding 'string'

If your images are large and you can't easily change that, one solution would be to not rely on encoding the entire image as as string and instead load the images via URLs (and maybe keep an additional reference to the image ID so you can always relate the annotations back). Your input data could then be a JSONL file and you could specify --loader jsonl in image.manual . For example:
{"image": "https://example.com/image1.jpg", "id": 123}
{"image": "https://example.com/image2.jpg", "id": 456}
One thing to note: Using local file paths for the images isn't going to work, since modern browsers typically block those for security reasons (see here for details ). So you'd either have to start a simple local server to host the directory of images, or upload them somewhere (like an S3 bucket).

irsal2009 · October 31, 2019, 10:26pm

Thanks Ines.
I uploaded my images in AWS S3 bucket, and created jsonl file, aws-images.jsonl to contain something like this,
{"image":"https://...label.s3-us-west-1.amazonaws.com/images/1.jpg", "id":1}
{"image":"https://...label.s3-us-west-1.amazonaws.com/images/2.jpg", "id":2}

And then execute,
prodigy image.manual my_dataset aws-images.jsonl --loader jsonl --label REFRIGERATOR,SINK,DISHWASHER,BATHTUB,SHOWER,TOILET,BED

I still have image base64 string on annotation output file. Is there something I did wrong?

Thanks

ines · November 1, 2019, 12:06pm

When you're exporting the data, are you looking at all the annotations in the dataset? The old annotations you've collected that include the image data will of course still be there – you've just changed the format of the new data you're reading in.

You can always convert the previous data and remove the "image" field and replace it with the filename (which should also be in the data) and then reupload it to a new dataset. Just make sure you keep the original files – if they change, you'll lose the reference to your annotations.

irsal2009 · November 1, 2019, 5:31pm

I created a new dataset and start from beginning.

ines · November 1, 2019, 5:36pm

Ah, sorry – I forgot that the built-in implementation always fetches the images by default. We should probably have an option that just lets you toggle this on the command line. You can just remove the following line from the recipe function: stream = fetch_images(stream).

When you look at the function, you'll see that it's actually really small and straightforward, so you could might also just want to write your own custom recipe.

irsal2009 · November 4, 2019, 6:40pm

Sorry still got problem:) . When I remove stream = fetch_images(stream), it throws error something like Name Error: name 'stream' is not defined.

And if we remove 'stream' from return {....'stream':...}, and execute the prodigy image.manual with the custom recipe, nothing happened, it did not show anything.

Thanks.

ines · November 4, 2019, 6:51pm

Yeah, you definitely shouldn't be removing the whole stream. The stream is the generator of examples that you're annotating. You can find more details on this in the documentation: https://prodi.gy/docs/worflow-custom-recipes

Maybe you removed too many lines? I'm not sure what you're editing, but for me, the recipe looks like this:

stream = get_stream(source, api=api, loader=loader, input_key="image")
stream = fetch_images(stream)

And I'm suggesting to remove the second line. If you do that, the variable stream will still be defined.

carolmanderson · January 30, 2020, 9:30pm

Hi, I have a similar question about how to prevent images from being stored as base64 strings in the database. I'm loading the images from a directory on the local machine. This is the recipe I'm using:

import prodigy
from prodigy.components.loaders import Images

def add_options(stream, labels):
    options = [{"id": label, "text": label.strip()} for label in
               labels.split(
        ",")]
    for eg in stream:
        eg["options"] = options
        yield eg

@prodigy.recipe('image-choice')
def image_choice(dataset, source, labels):
    stream = Images(source)
    stream = add_options(stream, labels)
    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "choice",
        'config': {'choice_style': 'multiple'},
        'feed_overlap' : True,  
    }

What should I change in order to stop the images themselves from being stored in the database?

ines · January 31, 2020, 11:26am

Hi! If you're using Prodigy v1.9.4+, the easiest way would be to use the ImageServer loader instead of Images. This will serve your image directory using the Prodigy web server so Prodigy can refer to them by a URL (and doesn't have to include the actual image data).

ines · June 17, 2020, 4:48pm

Just released Prodigy v1.10, which introduces a new before_db recipe callback that lets you remove any base64 data before the examples are placed in the database. The image.manual recipe now also has a --remove-base64 flag that takes care of this automatically.

c00lcoder · July 4, 2022, 3:10pm

Did anyone figure out a way to do this? I imported using jsonl (since it was the preferred method). I've annotated hundreds of images and now when I export, I get a ton of base64 hash like characters. I tried to reload using the --remove-base64 but it stills exports the same thing. Any guidance??!

koaning · July 5, 2022, 6:37am

Hi Gerald.

Could you share the full command that you run when you annotate images? I just made a tutorial on the --remove-base64 setting on my machine and it worked just fine. Also just to check; can you confirm that you're running a recent version of Prodigy?

c00lcoder · July 5, 2022, 12:02pm

Hi Vincent,

This is the command that I ran (prodigy image.manual BGB-001 ./bgb.jsonl --loader jsonl --remove-base64 --label PERSON,FACE

Yes it is the latest version of prodigy. Confirming there is a way to remove-base64 after you've already did the annotations?

koaning · July 6, 2022, 9:32am

If you really wanted to you could pass your dataset to a Python script and remove the images from there. There's a Python API that might be very helpful here too.

That script would look something like:

from prodigy.components.db import connect

db = connect()                               # uses settings from prodigy.json
dataset = db.get_dataset("old_dataset")      # retrieve a dataset

new_examples = [remove_img(e) for e in dataset]
db.add_dataset("new_dataset")                   # add dataset
db.add_examples(new_examples, ["new_dataset"])  # add examples to dataset

In the meantime, I'll double check if there might be a bug related to the --remove-base64 tag.

koaning · July 6, 2022, 9:37am

Just to check, the --remove-base64 setting is merely there to prevent an image from being saved in the database. A consequence of this is that the --remove-base64 setting can only remove the images from what you're currently adding to the database.

Can you confirm if this flag fails to prevent new images from being saved? If so, is it possible for you to share one such image?

c00lcoder · July 7, 2022, 1:51am

Hey, when I added a new dataset and used the flag I can confirm that the --remove-base64 flag worked as expected. I'll check out your script. I annotated over 500 images but the jsonl file is huge and hard to read...hoping not to lose that work and get a usable file.

koaning · July 7, 2022, 8:57am

Ah yeah, then the Python script is the way to go for now. If you hit any issues there let me know

c00lcoder · July 13, 2022, 1:24pm

Hey I keep getting an error when I try out the script to remove the base64. At first it said remove_img is not defined - so I simply did a remove_img = ().

Now it is saying:

  File "Prod_db_script.py", line 8, in <listcomp>
    new_examples = [remove_img(e) for e in dataset]
TypeError: 'tuple' object is not callable

Any other thoughts on how I can get around this to remove the base64 from the dataset?

Here is the entire script

from prodigy.components.db import connect

db = connect()

dataset = db.get_dataset("BGB-001")

remove_img = ()
new_examples = [remove_img(e) for e in dataset]
db.add_dataset("BGB-001v2")
db.add_examples(new_examples, ["BGB-001v2"])

koaning · July 13, 2022, 1:55pm

When I shared the code earlier I annotated it by saying that the script would look "something like" the snippet below. You'd still need to implement a function called remove_img. This function, if memory serves, merely needs to delete the key that contains the image though. So you could implement it via:

def remove_image(d):
    return {k: v for k, v in d.items() if k != 'image'}

Topic		Replies	Views
Using image.manual to correct bounding box annotations usage , image , solved	2	634	December 11, 2020
Don't send back base64 images to backend image , front-end	2	977	January 22, 2021
Extracting annotations from database usage , image	1	834	June 21, 2019
image.manual returns ValueError: Unmatched ''"' when when decoding 'string' database , image , solved	2	1629	October 30, 2019
How to access and remodel a dataset that has already been annotated with prodigy for images ? image	2	425	November 8, 2022

How to export annotation of image manual without image string base64

Related topics