How to update image annotations if I need to bulk rename filenames of the underlying images?

I have annotated a bunch of images in a folder (with image.manual), and exported (db-out) the corresponding annotations in a jsonl file.

Now, for some reason I had to change the filenames in the original files, and would like to "migrate" the annotations so to make them correspond to the renamed files, and then annotate some more.

A trivial substitution of the new filenames in the text and/or meta fields in the jsonl file doesn't seem to do the trick, possibly because I would need to recompute _input_hash as well. Is there a way to do this?

Thanks!

Just to make sure I understand the question correctly: The main problem in your case is that you're now presented examples for annotation again, since they're considered different (due to the new naming)?

In that case rehashing the existing annotations after renaming might be the easiest solution – you can use Prodigy's set_hashes helper for that and force overwriting:

examples = [prodigy.set_hashes(eg, overwrite=True) for eg in examples]

I'm trying to think if there's a more generic solution we could offer for this, but it's tricky, because the general assumptions the hashing/exclude logic makes is is reasonable, and in pretty much all other cases, it should consider a different filename "new information".

Just to make sure I understand the question correctly: The main problem in your case is that you're now presented examples for annotation again, since they're considered different (due to the new naming)?

Correct @ines, that's indeed the behavior I am seeing!
To reproduce: annotate a bunch of images, export the annotations to a jsonl annotations_with_old_naming.jsonl file, rename the underlying images (also replace the filenames in the jsonl file) and load them in again into a new_dataset with db-in. If you start annotating with image.manual the new_dataset, the examples are re-presented again in the Prodigy UI.

Following your advice, I tried

import prodigy
from prodigy.components.loaders import JSONL
import json

# before executing the following, I edit the jsonl file so that it contains the new filenames

jsonl_stream = JSONL("annotations_with_old_naming.jsonl")

examples = [prodigy.set_hashes(eg, overwrite=True) for eg in jsonl_stream]

jsonl_data = '\n'.join([json.dumps(line) for line in examples])

with open("annotations_rehashed.jsonl", "w") as f: 
    f.write(jsonl_data) 

that seems to have done the trick, as when doing

python -m prodigy db-in new_dataset annotations_rehashed.jsonl

and restarting annotating new_dataset with image.manual the already-annotated images don't show up anymore.

Cool! I think I also have an idea for a more generic solution to make this easier: if we add a --rehash flag to the db-merge command, you could use that to copy annotations from one or more datasets and also assign new hashes in the process. This would also be a useful feature in other situations.

Just released Prodigy v1.10, which adds the --rehash flag to thedb-merge command!

1 Like