Items or Task repetition problem

ngawangtrinley · July 10, 2023, 8:54am

So what we do is we have a csv file with a list of s3 images key in it. we give the csv file path to the custom recipe and it reads the file and go through the images keys and then stream that to the prodigy to be annotated. I have put the feed_overlap: false in the configuration.json. When the annotations are done and I use the db-out to get the jsonl file from the sqlite db. there are 2400 annotated images in the jsonl. that means about 1000 images are repeated or they are annotated more than once. when I filter the jsonl to get the images only once by ignoring the repeating annotation of that image, I get 1470 images annotated that aren't repeating. why is this happening ? is this due to the use of sqlite as the db ? Now I have started using annotations_per_task: 1 in the configuration.json and removed the feed_overlap key value from the configuration.json for the latest batch to annotated . will that helps in the giving the task to only a single annotator?

koaning · July 10, 2023, 9:51am

It's hard to know for sure without knowing the details of your setup.

Could you share your custom recipe? It could be that the hashing isn't set up appropriately.

Do you have many annotators? If so, do they use their own session name consistently? Do you have verbose logs available where you can confirm that tasks are assigned to multiple users?

In general: sqlite should not be the cause of this. The task routing/assignment of tasks is handled by Python code, the database is just there to store the annotations.

Let me know though! If there is a bug on our side I want to do a deep dive.

ngawangtrinley · July 12, 2023, 10:13am

We have 5 annotators using the same instance but with their session id
pecha.com/line_to_text/?session=tashi

We aren't using any hashing as well as task routing for now in this custom recipe.

import logging
import prodigy
from tools.config import s3_client, bucket_name
import jsonlines

@prodigy.recipe("line-to-text-recipe")
def line_to_text_recipe(dataset, jsonl_file):
    logging.info(f"dataset:{dataset}, jsonl_file_path:{jsonl_file}")
    blocks = [ 
        {"view_id": "image"},
        {"view_id": "text_input"}
    ]
    return {
        "dataset": dataset,
        "stream": stream_from_jsonl(jsonl_file),
        "view_id": "blocks",
        "config": {
            "blocks": blocks,
            "editable": True
        }
    }


def stream_from_jsonl(jsonl_file):
    with jsonlines.open(jsonl_file) as reader:
        for line in reader:
            image_id = line["id"]
            obj_key = line["image_url"]
            text = line["user_input"]
            image_url = get_new_url(obj_key)
            yield {"id": image_id, "image": image_url, "user_input": text}

def get_new_url(image_url):
    new_image_url = s3_client.generate_presigned_url(
        ClientMethod="get_object",
        Params={"Bucket": bucket_name, "Key": image_url},
        ExpiresIn=31536000
    )
    return new_image_url

and our jsonl is below

{"id": "1-1-1a_line_9874_0", "image_url": "line_images/1-1-1a_line_9874_0.jpg", "user_input": "༄༅། །འདུལ་བ་ཀ་པ་བཞུགས་སོ། །"}
{"id": "1-1-2b_line_9874_0", "image_url": "line_images/1-1-2b_line_9874_0.jpg", "user_input": "གསོ་སྦྱོང་གཞི་དང་ནི། །དགག་དབྱེ་དབྱར་དང་ཀོ་ལྤགས་གཞི། །སྨན་དང་གོས་དང་སྲ་བརྐྱང་དང༌། །ཀོའུ་ཤམ་བི་དང་ལས་ཀྱི་གཞི།།"}
{"id": "1-1-2b_line_9874_1", "image_url": "line_images/1-1-2b_line_9874_1.jpg", "user_input": "དམར་སེར་ཅན་དང་གང་ཟག་དང༌། །སྤོ་དང་གསོ་སྦྱོང་གཞག་པ་དང༌། །གནས་མལ་དང་ནི་རྩོད་པ་དང༌། །དགེ་འདུན་དབྱེན་"}

koaning · July 17, 2023, 8:40am

ngawangtrinley:

def stream_from_jsonl(jsonl_file):
    with jsonlines.open(jsonl_file) as reader:
        for line in reader:
            image_id = line["id"]
            obj_key = line["image_url"]
            text = line["user_input"]
            image_url = get_new_url(obj_key)
            yield {"id": image_id, "image": image_url, "user_input": text}

When I see this function, I notice that the hashes aren't set anywhere. It's also making me wonder: don't you see a warning appear when you start the recipe that's complaining about the lack of hashes?

I think this quick fix 'ought to patch it though:

from prodigy import set_hashes

def stream_from_jsonl(jsonl_file):
    with jsonlines.open(jsonl_file) as reader:
        for line in reader:
            image_id = line["id"]
            obj_key = line["image_url"]
            text = line["user_input"]
            image_url = get_new_url(obj_key)
            example = {"id": image_id, "image": image_url, "user_input": text}
            return set_hashes(example)

This function uses the set_hashes function, which will set properties that Prodigy can use to figure out if the image has been annotated before. By default it will be able to pick up the image key and use that to construct the appropriate _input_hash. This hash can be used to infer unique examples.

More information can be seen in the docs here:

Just to check though, do you want the user_input key to also be used in the hash? It might make sense to set it as a key for the task hash by setting task_keys=("user_input",) in the set_hashes function but that will mainly make sense if you'd like to use that key to denote an example "unique".

More information on hashing in Prodigy:

ngawangtrinley · July 18, 2023, 12:11pm

so we have been using about the same recipe for so many instances. we had never used the set_hashes() in those recipes and we never saw any warnings both on the UI as well as in the log file.
But I have started using this set_hashes()as you have advised. And we will let you know if it helps in curbing the repetitions.

ngawangtrinley · July 27, 2023, 9:40am

hi @koaning
so after using the set_hashes, the task are still repeating. so I have put 3000 items in the jsonl but after the annotation is done and when I db-out the annotations I got 3005 annotations in my output jsonl. the repetition issue is such a big headache for use because we have to pay them for the number annotation but we have no use of repeated annotation.

koaning · July 27, 2023, 11:27am

Just to be clear: in this example you've got an .jsonl file with 3000 examples in it and you annotate it yourself manually? When you do this, it seems that there are 5 duplicates? Or did you not annotate all the examples to reach the 3005 examples that come out of db-out? Or are multiple annotators involved here as well? Could this be due to work-stealing?

I'd be very eager to help you unravel what might be going wrong here, but it would help to have a bit more information on what is happening. Are you running your system with logging turned on? If the logging is set to verbose mode you should be able to see lines appear whenever the task router sends tasks to annotators. Is it possible for you to share some of these logs?

Can you also share the output of prodigy stats just so I have all of your version information? I'd also be interested in seeing your prodigy.json file if you have extra settings set up.

Finally, another alternative could be that I try to replicate your setup locally. So if you are able to confirm that this behavior also occurs with a minimal jsonl examples file I'd also be all ears.

ngawangtrinley · July 28, 2023, 6:03am

So, I had the same problem with another instance, where I have put 1774 images s3 keys in a .csv file and the custom recipe that goes through the csv file and then stream the images for annotations. when the annotators are done annotating. I db-out the output into the jsonl file. there are 1826 items annotated. so that means 52 images repeated or duplicated the item.

to answer your question, we have 4 annotators, that will annotate line segmentation on those page images.
and No, it is not due to work-stealing because we have put the feed_overlap: False

here is the link the db-out output jsonl file after the annotation. output jsonl

Below is the custom recipe that we used that produce 5 repetition for 3000 items and 52 repetition for 1774 items. I have used the set_hashes in the recipe as advised last time. It definitely has reduced the number of duplication but there still is the problem of duplication.

import csv
import json
import logging

import prodigy
from prodigy import set_hashes

from tools.config import s3_client, bucket_name


@prodigy.recipe("line-segmentation-recipe")
def line_segmentation_recipe(dataset, csv_file):
    logging.info(f"dataset:{dataset}, csv_file_path:{csv_file}")
    obj_keys = []
    with open(csv_file) as _file:
        for csv_line in list(csv.reader(_file, delimiter=",")):
            s3_key = csv_line[0]
            obj_keys.append(s3_key)
    return {
        "dataset": dataset,
        "stream": stream_from_s3(obj_keys),
        "view_id": "image_manual",
        "config": {
            "labels": ["Line"]
        }
    }


def stream_from_s3(obj_keys):
    for obj_key in obj_keys:
        eg = {}
        image_url = s3_client.generate_presigned_url(
            ClientMethod="get_object",
            Params={"Bucket": bucket_name, "Key": obj_key},
            ExpiresIn=31536000
        )
        image_id = (obj_key.split("/"))[-1]
        eg = {"id": image_id, "image": image_url}
        yield set_hashes(eg, input_keys=("id"))

Below is the example of the images s3 keys in the csv file, we give the csv file path when we call the custom recipe with name of the dataset.

Works/e3/W4CZ62374/images-web/W4CZ62374-I4CZ62377/I4CZ623770521.jpg_2000x700.jpg
Works/e3/W4CZ62374/images-web/W4CZ62374-I4CZ62377/I4CZ623770545.jpg_2000x700.jpg
Works/e3/W4CZ62374/images-web/W4CZ62374-I4CZ62378/I4CZ623780037.jpg_2000x700.jpg
Works/e3/W4CZ62374/images-web/W4CZ62374-I4CZ62378/I4CZ623780382.jpg_2000x700.jpg
Works/e3/W4CZ62374/images-web/W4CZ62374-I4CZ62378/I4CZ623780437.jpg_2000x700.jpg
Works/e3/W4CZ62374/images-web/W4CZ62374-I4CZ62378/I4CZ623780548.jpg_2000x700.jpg
Works/e3/W4CZ62374/images-web/W4CZ62374-I4CZ62379/I4CZ623790019.jpg_2000x700.jpg

Below is the prodigy.json or the configuration file we used for this

{
    "theme": "basic",
    "custom_theme": { "cardMaxWidth": 2000 },
    "buttons": ["accept", "reject", "ignore", "undo"],
    "batch_size": 5,
    "history_size": 10,
    "port": 8060,
    "host": "localhost",
    "cors": true,
    "db": "sqlite",
    "db_settings": {
      "sqlite": {
        "name": "line_segmentation.sqlite",
        "path": "/usr/local/prodigy"
      }
    },
    "validate": true,
    "image_manual_stroke_width": 2,
    "image_manual_font_size": 12,
    "feed_overlap": false,
    "auto_exclude_current": true,
    "instant_submit": true,
    "auto_count_stream": true,
    "total_examples_target": 0,
    "ui_lang": "en",
    "project_info": ["dataset", "session", "lang", "recipe_name", "view_id", "label"],
    "show_stats": false,
    "hide_meta": false,
    "show_flag": false,
    "instructions": false,
    "swipe": false,
    "swipe_gestures": { "left": "accept", "right": "reject" },
    "split_sents_threshold": false,
    "html_template": false,
    "global_css": null,
    "javascript": null,
    "writing_dir": "ltr",
    "show_whitespace": false,
    "exclude_by": "input"
  }

Below image is the prodigy stats of the prodigy we are using. we are using the latest version 1.12.4

the journalctl of the instance or the log file is at the link given. log file

And finally, I have not tried with only few images in the csv file so that means I don't know if this task repetition problem occurs only when large number of items are given to stream or it occurs also on small number of task.

Clarification, in the above Items or Task repetition problem - #6 by ngawangtrinley question I wrote that for the 3000 items or task in or example in the jsonl, which is actually in the csv file. but we also face the same repetition problem when we stream from the jsonl file also.
I will be waiting for a solution. Thanks

magdaaniol · July 28, 2023, 2:06pm

Hi @ngawangtrinley,

Thanks for sharing all the details and the logs. It's good to know that sorting out hashes reduced the number of duplicates.
The next source of duplicates to consider is the work stealing mechanism. I know, you've mentioned that you think this is not the case because you have set feed_overlapto false. Work stealing, however, is an independent setting. While feed_overlap determines how task should be distributed, work_stealing makes sure no data is lost in case some of the annotators remain idle.
Not sure if you had the chance to check out this section of our docs for more context/motivation behind this feature (I believe @koaning has shared it with you before, just referencing it back for convenience)

Looking at the logs you can definitely see that there's work stealing happening: https://github.com/ta4tsering/prodigy/blob/f15effdfd54b800a1d63a9a0a393b47cc3840c1c/logs_from_line_segmentation_modern.txt#L4930
In order to prevent it, please add allow_work_stealing: false to your config file.

We have also identified that duplicates may be introduced when an annotator presses ctrl-s right after accepting an example with instant_submit set to true. It's in our radar but for now the instruction would be to avoid doing that given that instant_submittakes care of saving the data.
I do feel, though, that work stealing should be responsible for the remaining duplicates in your dataset.

Topic		Replies	Views
Tasks are duplicated	3	438	June 7, 2023
Task routers - Problem in JSONL output/Annotators Problem ner	7	403	August 3, 2023
Duplicate tasks when starting a new session usage , custom	1	756	May 1, 2019
Duplicates in revised annotations usage	2	574	May 29, 2019
Task does not skip annotated examples streams	3	390	October 5, 2021

Items or Task repetition problem

Related topics