duplicate images when annotating

Hi!

I have a problem with duplicated images when I annotate. I can receive one image two times when I annotate.

An example (you can see that I already annotated this image before in left history bar):

And after annotation ends prodigy understands that it got less unique pictures than were annotated:

The command which is used for annotation:
prodigy image.manual michaelkors_checked ./michaelkors.jsonl --loader jsonl --label "MICHAELKORS1","MICHAELKORS2"

I use the last version of prodigy (1.10.3)

How can I solve my problem?

Hi @DanilKonon,

Sorry to hear you're seeing duplicates. I tried to reproduce the problem you describe using 1.10.3 and your command line example, but I couldn't get duplicates to show up.

Can you tell me more about your configuration to help figure out what's gone wrong? What values do you set (if any) in prodigy.json file? You are loading the images using a JSONL file, can you give me an example of one or two entries in your file, so I can make sure mine look the same?

Thanks,
-Justin

1 Like

Hi @justindujardin

Thank you for your answer!

Here is prodigy.json:

{
  "db": "mysql",
    "db_settings": {
      "mysql": {
        "host": "host",
        "user": "user",
        "passwd": "pass",
        "db": "prodigy"
      }
    },
  "port": 8182,
  "mysql_max_len": 16777215,
  "feed_overlap": true,
  "host": "host",
  "swipe": true,
  "show_stats": true,
  "show_flag": true,
  "validate": false,
  "custom_theme": {
    "buttonSize": 50
  },
  image_manual_stroke_width: 10,
}

And I attach first 100 annotations from jsonl. I got them with db-out.

1 Like

@DanilKonon thanks for the extra information! I used the annotations and prodigy.json configuration you linked, but I still can't reproduce the duplicates.

I actually think I need a few lines from ./michaelkors.jsonl instead of the db-out annotations to try reproducing your problem. Like you pointed out when the server receives the duplicates, it only stores one of them in the database, so using db-out to send the annotations doesn't include the duplicate that caused the problem. Can you attach the first 100 lines of ./michaelkors.jsonl even if it only has file paths instead of image data?

Since it's not been easy to reproduce the problem, can I ask about the latency to your MySQL database? It can be hard to estimate, so I wrote a small recipe that will report on it. If you could run it and let me know the results, it would be helpful:

timing.py

import time

from prodigy.components.db import connect
from prodigy.core import recipe
from prodigy.util import color, get_timestamp_session_id, msg, set_hashes


@recipe("timing")
def timing():
    """Report on the observed latency of databse operations"""

    def to_elapsed(time1, time2) -> str:
        delta = time2 - time1
        text = "{:.3f}s".format(delta)
        if delta < 1.0:
            colr = "green"
        elif delta < 3.0:
            colr = "yellow"
        else:
            colr = "red"
        return color(text, colr)

    DB = connect()
    num = 250
    annotations = [
        {"text": f"Text {i}", "answer": "accept", "label": "TEST_LABEL"}
        for i in range(num)
    ]
    annotations = [set_hashes(eg) for eg in annotations]
    set_id = f"heatlh-recipe-{get_timestamp_session_id()}"
    if set_id in DB:
        DB.drop_dataset(set_id)

    start = time.time()
    DB.add_dataset(set_id)
    stop = time.time()
    t_add_dataset = to_elapsed(start, stop)

    start = time.time()
    DB.add_examples(annotations, datasets=[set_id])
    stop = time.time()
    t_add_examples = to_elapsed(start, stop)

    start = time.time()
    DB.drop_dataset(set_id, batch_size=100)
    stop = time.time()
    t_drop_dataset = to_elapsed(start, stop)

    msg.divider("Database Timing")
    msg.table(
        [
            ["Create a Dataset", t_add_dataset],
            [f"Add {num} Examples", t_add_examples],
            ["Drop the Dataset", t_drop_dataset],
        ]
    )

You can run it like:

prodigy timing -F timing.py 

It should print some results like:

============================== Database Timing ==============================

Create a Dataset   0.004s
Add 250 Examples   0.120s
Drop the Dataset   0.117s
1 Like

@justindujardin sorry for a long answer

here is my db timing:

============================== Database Timing ==============================

Create a Dataset   0.597s
Add 250 Examples   0.707s
Drop the Dataset   0.314s 

Actually, the link which I attached has a file with first hundred annotations from ./michaelkors.jsonl. I just wanted to tell you that ./michaelkors.jsonl file was gotten with db-out.

Thanks for clarifying about the jsonl file, and sharing your DB timings, it was helpful!

I was able to reproduce seeing duplicates by holding down the shortcut key to accept images as quickly as possible until all 100 were done. :sweat_smile: When I did that, prodigy would show me 101 or 102 images, but only store the expected 100 in the database.

I think the trouble is that when you answer questions at the same time as the app asks for more, it can miss reporting one or two items that you've answered but not yet sent back to the server, which causes the server to send them again.

I came up with a fix that filters received duplicates from the server. It fixes the duplicates in all my tests. Would you be willing to try a preview build with my fix? If so, please send me an email at justin@explosion.ai.

Thanks,
-Justin

Update: I got confirmation via email that the proposed fix works.

2 Likes

Just released Prodigy v1.10.4, which includes the fix! :slightly_smiling_face:

1 Like