filter_inputs still causes duplicated image

def VOCImg(dataset, images_path):
labels = ["Player"]
#stream = Images(images_path)
stream = filter_inputs(Images(images_path), connect().get_input_hashes(dataset))
def Submission(MainData):
    PascalVOC(dataset, MainData, labels, images_path)
    print(f"{len(MainData)} annotations has been submitted.")
def Progress(ctrl, update_return_value):
    return ctrl.total_annotated / len([f for f in os.listdir(images_path)if os.path.isfile(os.path.join(images_path, f))])

def ProcessData(SubmittedData):
    for data in SubmittedData:
        if data["image"].startswith("data:") and "path" in data:
            data["image"] = data["path"]
    return SubmittedData
return {
    "dataset": dataset,
    "view_id": "image_manual",
    "stream": stream,
    "update": Submission,
    "progress": Progress,
    "db": True,
    "config": {
"labels": labels,
"buttons": ["accept", "reject", "undo"]
},
    "before_db": ProcessData
}

So currently, I'm trying to make a system where it will save to pascal VOC format when a user submits the labeling. I tend to restart when adding labels. Recently, I discovered that the images continuously to duplicate (or show again for annotation even though it already annotated) even after using filter_inputs (which was suggested on a different topic) after restarting the app. What would be the best way to prevent duplicate images after restarting the app?

Hi! By duplicate, do you mean, images that are already annotated show up again when you re-start the server (e.g. to add a label)?

If so, have you inspected the hashes, and do they come back different?

Btw, you should also be able to add "exclude_by": "input" to the "config" returned by your recipe. This should have the same effect as the input filtering you're doing and it will tell Prodigy to consider two incoming examples with the same input hashes duplicates (even if other annotations, like suggested bounding boxes etc., are different).

Ah, I haven't looked at the hashes yet since I just made an assumption that it would but I will take a look at it and yes, the image that were already annotated shows up after restarting the app.

Alright, I just labeled the duplicated data, tested the table and it indeed does have duplicated input_hash. I will go head and try adding "exclude_by":"input" under the config and see if that would help.

Edit: I just tried it, unfortunately, the duplication still showed up.

@zhiyan114 Thanks for checking, it's definitely good to know that it's actually duplicate input hashes and not different hashes generated for the same image etc.

How quickly are you annotating? I wonder if you might be hitting the race condition described here:

I would say, it took at least 5 seconds per image annotation. Duplication only occur when I restart the app, otherwise, it would operate normally.

Okay, so this sounds more like the hashes that are generated when the data is loaded back don't match the hashes present in the dataset.

I just noticed that the current solution you have there with filter_inputs wouldn't work right after loading the images back, because they won't yet have hashes assigned. So you can call prodigy.set_hashes on the incoming examples, and then check if you already have the hash in your data. There's pretty little magic going on here actually and the logic is mostly this:

input_hashes = connect().get_input_hashes(dataset)
stream = Images(source)
for eg in stream:
    eg = set_hashes(eg)
    if eg["_input_hash"] not in input_hashes:
        yield eg

If you do find an example that was previously annotated but received a different hash, then we can investigate what caused the difference.

Thanks, this solution seems to solve the issue I was having. I'll go head and reset the dataset and annotation and will let you know if the problem persist.

EDIT: So far so good. Everything are currently running as expected. Thanks again for the help.

Though I have one more question if you don't mind answering: Is it possible to define certain bounding box colors based on what the annotator chooses? (In VOC, there are option to select whether if the boxed area are only partially visible or difficult to detect or both so I want to let annotator to define that using the boxed color instead of using multiple different label). Using labels to define that would increase the amount of back-end code so I just want to confirm if this is possible?

1 Like

Hi @ines,
I'm having a problem fith the filter_input function as well, so thought I would post it here.

I'm trying to filter my stream by the _input_hash using the filter_input function, but I can't get it to work. I've checked the database, and the hashes are the same.

My setup is as follows:

  • I've written a custom loader that dumps and print the json. I'm manually adding an _input_hash to the task using pythons build-in hash method.
  • I've written a custom recipe using the blocks interface
  • To read from the sys.stdin, Im using the ´get_stream´ method.
@prodigy.recipe(
    'task1', 
    dataset=('The dataset to store data in', 'positional',None, str),
    source=("Data to annotate (file path or '-' to read from standard input)", "positional", None, str),
)
def task1(dataset, source):
    db = connect()
    input_hashes = db.get_input_hashes(dataset)
    stream = prodigy.get_stream(source)
    stream = add_options(stream)
    stream = filter_inputs(stream,input_hashes)
    return {
        'stream': stream,
        'dataset': dataset,
        'view_id':'blocks',
        'config':{
            'blocks':[
                {'view_id':'html','html_template':html2},
                {'view_id':'choice'},
                {'view_id':'html','html_template':html2},
            ],
            "history_size":10,
            "choice_style":"single",
            'javascript':functions,
        }
    }

And here is the output from db-out. It consist of three examples, and as you can see by comparing line (1,4),(2,5),(3,6) the input hashes are the same

{"meta":{"id":"1003517020111303_1003525683443770","source":"se_facebook","reaction_count":0,"angry_count":0},"text":"Hej ","_input_hash":943287096108137100,"options":[{"id":0,"text":"Offensive"},{"id":1,"text":"Hateful"},{"id":2,"text":"Violent"},{"id":99,"text":"Hard to say"}],"_task_hash":-1290115455,"_session_id":null,"_view_id":"blocks","config":{"choice_style":"single"},"accept":[],"answer":"accept"}
{"meta":{"id":"1003517020111303_1003525763443762","source":"se_facebook","reaction_count":0,"angry_count":0},"text":"Hej","_input_hash":6904969615189831000,"options":[{"id":0,"text":"Offensive"},{"id":1,"text":"Hateful"},{"id":2,"text":"Violent"},{"id":99,"text":"Hard to say"}],"_task_hash":-1114970609,"_session_id":null,"_view_id":"blocks","accept":[],"config":{"choice_style":"single"},"answer":"accept"}
{"meta":{"id":"1003517020111303_1003525810110424","source":"se_facebook","reaction_count":0,"angry_count":0},"text":"Lyssnar","_input_hash":2672328271787085300,"options":[{"id":0,"text":"Offensive"},{"id":1,"text":"Hateful"},{"id":2,"text":"Violent"},{"id":99,"text":"Hard to say"}],"_task_hash":378550198,"_session_id":null,"_view_id":"blocks","accept":[99],"config":{"choice_style":"single"},"answer":"accept"}
{"meta":{"id":"1003517020111303_1003525683443770","source":"se_facebook","reaction_count":0,"angry_count":0},"text":"Hej ","_input_hash":943287096108137100,"options":[{"id":0,"text":"Offensive"},{"id":1,"text":"Hateful"},{"id":2,"text":"Violent"},{"id":99,"text":"Hard to say"}],"_task_hash":-1290115455,"_session_id":null,"_view_id":"blocks","config":{"choice_style":"single"},"accept":[],"answer":"accept"}
{"meta":{"id":"1003517020111303_1003525763443762","source":"se_facebook","reaction_count":0,"angry_count":0},"text":"Hej","_input_hash":6904969615189831000,"options":[{"id":0,"text":"Offensive"},{"id":1,"text":"Hateful"},{"id":2,"text":"Violent"},{"id":99,"text":"Hard to say"}],"_task_hash":-1114970609,"_session_id":null,"_view_id":"blocks","accept":[],"config":{"choice_style":"single"},"answer":"accept"}
{"meta":{"id":"1003517020111303_1003525810110424","source":"se_facebook","reaction_count":0,"angry_count":0},"text":"Lyssnar","_input_hash":2672328271787085300,"options":[{"id":0,"text":"Offensive"},{"id":1,"text":"Hateful"},{"id":2,"text":"Violent"},{"id":99,"text":"Hard to say"}],"_task_hash":378550198,"_session_id":null,"_view_id":"blocks","accept":[99],"config":{"choice_style":"single"},"answer":"accept"}

I've tried out different methods to filter, including setting the hashes in the recipe using set_hashes, using the dedup param which fails as the task has no _task_hash at stream time.

Please let me know, if you need more information to help me debug this.

Thanks!

It looks like the problem here is that your incoming examples that are loaded from disk don't have input hashes, so nothing is filtered. Ultimately, you probably just want to do something like this:

def filter_inputs(stream):
    for eg in stream:
        eg = set_hashes(eg)
        if eg["_input_hash"] not in input_hashes:
            yield eg

This should also make it a bit easier to debug things, because you can just print the incoming examples and check their hashes.