How to export annotation of image manual without image string base64

I am just new on using prodigy. I already run image.manual and annotate the image with bounding box. And then I exported the annotation using db-out. However, on the image key it shows the string base64 of the image. What do I need to do in order to get annotation with just file name of image without any base64 string at all?
Thank you very much.

Hi! What you get out at the end is always what you load in (which is kind of a core principle so you don't lose any data). If you don't want the images to be encoded, you can load in the image URLs from a JSONL file instead of the image directory.

I've posted an example of this in this thread the other day:

Thanks Ines.
I uploaded my images in AWS S3 bucket, and created jsonl file, aws-images.jsonl to contain something like this,
{"image":"https://...label.s3-us-west-1.amazonaws.com/images/1.jpg", "id":1}
{"image":"https://...label.s3-us-west-1.amazonaws.com/images/2.jpg", "id":2}

And then execute,
prodigy image.manual my_dataset aws-images.jsonl --loader jsonl --label REFRIGERATOR,SINK,DISHWASHER,BATHTUB,SHOWER,TOILET,BED

I still have image base64 string on annotation output file. Is there something I did wrong?

Thanks

When you're exporting the data, are you looking at all the annotations in the dataset? The old annotations you've collected that include the image data will of course still be there – you've just changed the format of the new data you're reading in.

You can always convert the previous data and remove the "image" field and replace it with the filename (which should also be in the data) and then reupload it to a new dataset. Just make sure you keep the original files – if they change, you'll lose the reference to your annotations.

I created a new dataset and start from beginning.

Ah, sorry – I forgot that the built-in implementation always fetches the images by default. We should probably have an option that just lets you toggle this on the command line. You can just remove the following line from the recipe function: stream = fetch_images(stream).

When you look at the function, you'll see that it's actually really small and straightforward, so you could might also just want to write your own custom recipe.

Sorry still got problem:) . When I remove stream = fetch_images(stream), it throws error something like Name Error: name 'stream' is not defined.

And if we remove 'stream' from return {....'stream':...}, and execute the prodigy image.manual with the custom recipe, nothing happened, it did not show anything.

Thanks.

Yeah, you definitely shouldn't be removing the whole stream. The stream is the generator of examples that you're annotating. You can find more details on this in the documentation: https://prodi.gy/docs/worflow-custom-recipes

Maybe you removed too many lines? I'm not sure what you're editing, but for me, the recipe looks like this:

stream = get_stream(source, api=api, loader=loader, input_key="image")
stream = fetch_images(stream)

And I'm suggesting to remove the second line. If you do that, the variable stream will still be defined.

Hi, I have a similar question about how to prevent images from being stored as base64 strings in the database. I'm loading the images from a directory on the local machine. This is the recipe I'm using:

import prodigy
from prodigy.components.loaders import Images

def add_options(stream, labels):
    options = [{"id": label, "text": label.strip()} for label in
               labels.split(
        ",")]
    for eg in stream:
        eg["options"] = options
        yield eg

@prodigy.recipe('image-choice')
def image_choice(dataset, source, labels):
    stream = Images(source)
    stream = add_options(stream, labels)
    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "choice",
        'config': {'choice_style': 'multiple'},
        'feed_overlap' : True,  
    }

What should I change in order to stop the images themselves from being stored in the database?

Hi! If you're using Prodigy v1.9.4+, the easiest way would be to use the ImageServer loader instead of Images. This will serve your image directory using the Prodigy web server so Prodigy can refer to them by a URL (and doesn't have to include the actual image data).

2 Likes

Just released Prodigy v1.10, which introduces a new before_db recipe callback that lets you remove any base64 data before the examples are placed in the database. The image.manual recipe now also has a --remove-base64 flag that takes care of this automatically.

1 Like

Did anyone figure out a way to do this? I imported using jsonl (since it was the preferred method). I've annotated hundreds of images and now when I export, I get a ton of base64 hash like characters. I tried to reload using the --remove-base64 but it stills exports the same thing. Any guidance??!

Hi Gerald.

Could you share the full command that you run when you annotate images? I just made a tutorial on the --remove-base64 setting on my machine and it worked just fine. Also just to check; can you confirm that you're running a recent version of Prodigy?

Hi Vincent,

This is the command that I ran (prodigy image.manual BGB-001 ./bgb.jsonl --loader jsonl --remove-base64 --label PERSON,FACE

Yes it is the latest version of prodigy. Confirming there is a way to remove-base64 after you've already did the annotations?

If you really wanted to you could pass your dataset to a Python script and remove the images from there. There's a Python API that might be very helpful here too.

That script would look something like:

from prodigy.components.db import connect

db = connect()                               # uses settings from prodigy.json
dataset = db.get_dataset("old_dataset")      # retrieve a dataset

new_examples = [remove_img(e) for e in dataset]
db.add_dataset("new_dataset")                   # add dataset
db.add_examples(new_examples, ["new_dataset"])  # add examples to dataset

In the meantime, I'll double check if there might be a bug related to the --remove-base64 tag.

Just to check, the --remove-base64 setting is merely there to prevent an image from being saved in the database. A consequence of this is that the --remove-base64 setting can only remove the images from what you're currently adding to the database.

Can you confirm if this flag fails to prevent new images from being saved? If so, is it possible for you to share one such image?

Hey, when I added a new dataset and used the flag I can confirm that the --remove-base64 flag worked as expected. I'll check out your script. I annotated over 500 images but the jsonl file is huge and hard to read...hoping not to lose that work and get a usable file.

1 Like

Ah yeah, then the Python script is the way to go for now. If you hit any issues there let me know :slight_smile:

Hey I keep getting an error when I try out the script to remove the base64. At first it said remove_img is not defined - so I simply did a remove_img = ().

Now it is saying:

  File "Prod_db_script.py", line 8, in <listcomp>
    new_examples = [remove_img(e) for e in dataset]
TypeError: 'tuple' object is not callable

Any other thoughts on how I can get around this to remove the base64 from the dataset?

Here is the entire script

from prodigy.components.db import connect

db = connect()

dataset = db.get_dataset("BGB-001")

remove_img = ()
new_examples = [remove_img(e) for e in dataset]
db.add_dataset("BGB-001v2")
db.add_examples(new_examples, ["BGB-001v2"])

When I shared the code earlier I annotated it by saying that the script would look "something like" the snippet below. You'd still need to implement a function called remove_img. This function, if memory serves, merely needs to delete the key that contains the image though. So you could implement it via:

def remove_image(d):
    return {k: v for k, v in d.items() if k != 'image'}