Creating custom labels review recipe to remove noise from the dataset

nlp-guy · August 15, 2022, 5:53pm

Hi,
I want to review the labels of a dataset and remove noisy ones using Prodigy. The dataset comprises robotic surgery videos where each clip has one label describing the presence of tools in four robotic arms. See an example label for a video clip below:

I extracted 120 frames from each video and used the video-level labels to generate labels for the extracted frames. Now here is the problem. It is not necessary that each frame will have the same set of tools present as the video has. Surgeons move out tools temporarily to clear their view. The labels needs correcting for such frames. The following image is an example where two tools are present but the label says three.
Screenshot from 2022-08-15 18-30-43

I want to create a Prodigy interface with an image item and 4 choice items (one for each robotic arm) to fix these labels. I have the following UI in mind:

Don't know if Prodigy will allow such customisation. I can generate a JSON file for each clip to describe annotated labels which can be used to select the tools by default for each image when rendered in the Prodigy.

I have tried to write the following recipe:

@prodigy.recipe(("data-review-recipe"))
def data_review_recipe(dataset, images_path):    

    stream = Images(images_path)
    
    tools = [
        "needle_driver",
        "monopolar_curved_scissor",
        "force_bipolar",
        "clip_applier",
        "tip_up_fenestrated_grasper",
        "cadiere_forceps",
        "bipolar_forceps",
        "vessel_sealer",
        "suction_irrigator",
        "bipolar_dissector",
        "prograsp_forceps",
        "stapler",
        "permanent_cautery_hook_spatula",
        "grasping_retractor",
        "nan",
        "blank"
    ]

    options=[{"id": i, "text":i} for i in tools]
    
    blocks = [
        {"view_id":"image"},
        {"view_id":"choices", "options":options}
        {"view_id":"choices", "options":options}
        {"view_id":"choices", "options":options}
        {"view_id":"choices", "options":options}
    ]
    
    return {
        "view_id": "blocks",
        "config": {"blocks":blocks},
        "dataset": dataset,
		"stream": stream,
    }

But it is far from what I am trying to achieve. I don't know how to create such UI and then attach options to 4 choice lists for each image which if incorrect can be edited.

Can someone please guide me through this functionality?

I will greatly appreciate your help on this.

Many thanks and

Kind Regards,
Bilal

nlp-guy · August 15, 2022, 6:11pm

I have tried to customise streaming logic and associate options with each image item as following:

@prodigy.recipe(("data-review-recipe"))
def data_review_recipe(dataset, images_path):    
    
    tools = [
        "needle_driver",
        "monopolar_curved_scissor",
        "force_bipolar",
        "clip_applier",
        "tip_up_fenestrated_grasper",
        "cadiere_forceps",
        "bipolar_forceps",
        "vessel_sealer",
        "suction_irrigator",
        "bipolar_dissector",
        "prograsp_forceps",
        "stapler",
        "permanent_cautery_hook_spatula",
        "grasping_retractor",
        "nan",
        "blank"
    ]

    options=[{"id": i, "text":i} for i in tools]
    
    blocks = [
        {"view_id":"image"},
        {"view_id":"choices"}
    ]
    
    def get_stream():
        stream = Images(images_path)
        for item in stream:
            item['options']= options
            yield item
    
    
    return {
        "view_id": "blocks",
        "config": {"blocks":blocks},
        "dataset": dataset,
		"stream": get_stream(),
    }

Now I get the following error:

Is this the right direction?

nlp-guy · August 15, 2022, 6:42pm

Somehow I figured it out to display choices for labelling one arm using the code below:

@prodigy.recipe(("data-review-recipe"))
def data_review_recipe(dataset, images_path):    
    
    stream = Images(images_path)
   
    blocks = [
        {"view_id":"image"},
        {"view_id":"choice"},
    ]
    
    def add_options(stream):
        
        tools = [
            "needle_driver",
            "monopolar_curved_scissor",
            "force_bipolar",
            "clip_applier",
            "tip_up_fenestrated_grasper",
            "cadiere_forceps",
            "bipolar_forceps",
            "vessel_sealer",
            "suction_irrigator",
            "bipolar_dissector",
            "prograsp_forceps",
            "stapler",
            "permanent_cautery_hook_spatula",
            "grasping_retractor",
            "nan",
            "blank"
        ]
        
        options=[{"id": t, "text":t} for t in tools]
        
        for item in stream:
            item['options']= options
            yield item
    
    stream = add_options(stream)

    return {
        "view_id": "blocks",
        "config": {"blocks":blocks},
        "dataset": dataset,
		"stream": stream,
    }

The output is as follows:

But it shows images twice now don't know why. Besides, no idea how to add three choices next to each other.

ryanwesslen · August 15, 2022, 8:36pm

hi @nlp-guy!

Very interesting project! Thanks for sharing and your questions.

Four different input panels (arms) may be challenging. The simplest would be to label one arm at a time. However, I bet you've already rejected that idea to avoid doing 4x annotations.

Another option may be to create four vertically stacked input boxes with field suggestions. You'd use the open-ended text input box but add in field_suggestions which then uses an auto-suggest and auto-complete. You can tab between each of the boxes, filling in your categories by auto-completing.

arm

Here's the code of an example:

import prodigy
from prodigy.components.loaders import Images

@prodigy.recipe(("data-review-recipe"))
def data_review_recipe(dataset, images_path):    
    
    stream = Images(images_path)
   
    tools = [
        "needle_driver",
        "monopolar_curved_scissor",
        "force_bipolar",
        "clip_applier",
        "tip_up_fenestrated_grasper",
        "cadiere_forceps",
        "bipolar_forceps",
        "vessel_sealer",
        "suction_irrigator",
        "bipolar_dissector",
        "prograsp_forceps",
        "stapler",
        "permanent_cautery_hook_spatula",
        "grasping_retractor",
        "nan",
        "blank"
    ]

    blocks = [
        {"view_id":"image"},
        {"view_id": "text_input", "field_id": "arm_a", "field_placeholder": "Arm A", "field_suggestions": tools},
        {"view_id": "text_input", "field_id": "arm_b", "field_placeholder": "Arm B", "field_suggestions": tools},
        {"view_id": "text_input", "field_id": "arm_c", "field_placeholder": "Arm C", "field_suggestions": tools},
        {"view_id": "text_input", "field_id": "arm_d", "field_placeholder": "Arm D", "field_suggestions": tools},
    ]

    return {
        "view_id": "blocks",
        "config": {"blocks": blocks},
        "dataset": dataset,
		"stream": stream,
    }

A few downsides to this (maybe there's a solution):

You can only select one field at a time
You can still enter other text than these categories. This is bad if you accidentally misstype something. Ideally you may need to validate these fields (e.g., using validate_answer callback) after to ensure they're only of these categories. I found by using the auto-correct (press down) will ensure that it finds the closest.
Likely this could be improved with default/placeholders.

If this doesn't work, then likely your next solution would be to create custom javascript. This post below has one idea of adding in a "check box" to perhaps only show one each category (arm) at a time. Perhaps you could either create four check boxes to show the input per arm (e.g., Check boxes are ARM A, ARM B, ARM C, ARM D).

Not sure. Does it always show a duplicate image or a different image?

Could you have modified something in your prodigy.json? Perhaps vim /path/to/prodigy.json and double check you don't have any overrides?

Alternatively, this should work too if you want to reset your config overrides:

export PRODIGY_CONFIG_OVERRIDES="{}"

Let me know if this persists and we can follow up but I would suspect it's something in your code somewhere.

nlp-guy · August 16, 2022, 9:36am

Hi @ryanwesslen,

This UI will work as well. I don't see repeated images anymore with your code. Thanks. I am changing the width and height of the image card in the prodigy.json.

{
  	"custom_theme": {"cardMaxWidth":640, "cardMaxHeight":512}
}

The only part that remains unimplemented is to assign default values to each of these fields which are old labels. I can create a JSON file with image paths and labels from the data frame.

Any ideas on how to do that?

ryanwesslen · August 16, 2022, 1:57pm

hi @nlp-guy,

Yep. To pre-fill, you need to have the key of each label align to the name of the field_id, for example:

{"image": "data/image-arms/images.png", "arm_a": "needle_driver", "arm_b": "nan", "arm_c": "needle_driver", "arm_d": "cadiere_forceps"}

See this post for more details

Since you'll be loading a file (.jsonl), you'll need to use the JSONL loader but also the fetch_media importer to get the images.

I wrote a modified version of the script above that assumes a .jsonl like I showed above:

import prodigy
from prodigy.components.preprocess import fetch_media
from prodigy.components.loaders import JSONL

@prodigy.recipe(("data-review-recipe"))
def data_review_recipe(dataset, image_file):    
    
    stream = JSONL(image_file)
    stream = fetch_media(stream, ["image"], skip=True)
   
    tools = [
        "needle_driver",
        "monopolar_curved_scissor",
        "force_bipolar",
        "clip_applier",
        "tip_up_fenestrated_grasper",
        "cadiere_forceps",
        "bipolar_forceps",
        "vessel_sealer",
        "suction_irrigator",
        "bipolar_dissector",
        "prograsp_forceps",
        "stapler",
        "permanent_cautery_hook_spatula",
        "grasping_retractor",
        "nan",
        "blank"
    ]

    blocks = [
        {"view_id":"image"},
        {"view_id": "text_input", "field_id": "arm_a", "field_placeholder": "Arm A", "field_suggestions": tools},
        {"view_id": "text_input", "field_id": "arm_b", "field_placeholder": "Arm B", "field_suggestions": tools},
        {"view_id": "text_input", "field_id": "arm_c", "field_placeholder": "Arm C", "field_suggestions": tools},
        {"view_id": "text_input", "field_id": "arm_d", "field_placeholder": "Arm D", "field_suggestions": tools},
    ]

    return {
        "view_id": "blocks",
        "config": {"blocks": blocks},
        "dataset": dataset,
		"stream": stream,
    }

It seemed to work for me. Hope this helps!

nlp-guy · September 6, 2022, 11:06pm

Thank you @ryanwesslen. This is very helpful.

A quick one; I create a JSON file with the following contents for all the images (~700k) in the dataset:

[{'image': 'data/train_images_crop_sml/clip_000000/00510.jpg', 'arm_a':
'needle_driver', 'arm_b': 'nan', 'arm_c': 'needle_driver', 'arm_d': 'nan'},
{'image': 'data/train_images_crop_sml/clip_000000/00195.jpg', 'arm_a':
'needle_driver', 'arm_b': 'nan', 'arm_c': 'needle_driver', 'arm_d': 'nan'},
{'image': 'data/train_images_crop_sml/clip_000000/01605.jpg', 'arm_a':
'needle_driver', 'arm_b': 'nan', 'arm_c': 'needle_driver', 'arm_d': 'nan'},
{'image': 'data/train_images_crop_sml/clip_000000/01755.jpg', 'arm_a':
'needle_driver', 'arm_b': 'nan', 'arm_c': 'needle_driver', 'arm_d': 'nan'},
{'image': 'data/train_images_crop_sml/clip_000000/01290.jpg', 'arm_a':
'needle_driver', 'arm_b': 'nan', 'arm_c': 'needle_driver', 'arm_d': 'nan'},
{'image': 'data/train_images_crop_sml/clip_000000/00420.jpg', 'arm_a':
'needle_driver', 'arm_b': 'nan', 'arm_c': 'needle_driver', 'arm_d': 'nan'}
]

Then, I run the following command to start the prodigy:

prodigy data-review-recipe datareview2 ./prodigy_input.jsonl -F recipe.py

I received the following error:

Any ideas on what is wrong with my commands.

Thanks in advance.

Best regards,
Bilal

nlp-guy · September 7, 2022, 12:00am

Never mind. I figured it out. The input file needed to be in JSONL format. Resolved.

nlp-guy · September 7, 2022, 12:08am

I don't want to store images in the database. The --remove-base64 is not working when used in the following command:

Besides is there a way to ignore exporting images in the db-out command in case they have been mistaken loaded into the Prodigy database?

ryanwesslen · September 7, 2022, 4:24pm

hi @nlp-guy!

You can add this to your custom recipe by modifying:

@prodigy.recipe(("data-review-recipe"))
def data_review_recipe(dataset, image_file, remove_base64):    
    
    stream = JSONL(image_file)
    stream = fetch_media(stream, ["image"], skip=True)

    ...

    def before_db(examples: List[TaskType]) -> List[TaskType]:
        # Remove all data URIs before storing example in the database
        for eg in examples:
            if eg["image"].startswith("data:"):
                eg["image"] = eg.get("path")
        return examples

    return {
        "view_id": "blocks",
        "before_db": before_db if remove_base64 else None, 
        "config": {"blocks": blocks},
        "dataset": dataset,
		"stream": stream,
    }

What do you mean by "mistaken loaded"? How would you identify what annotations in your dataset have been mistaken loaded?

Here's what the recipe looks like -- it is fairly lightweight. Can you modify it to drop records that you think should be excluded?

from prodigy.components.db import connect
from typing import List, Optional, Union
from pathlib import Path
import srsly

@recipe(
    "db-out",
    set_id=("Name of dataset to export", "positional", None, str),
    out_dir=("Path to output directory", "positional", None, str),
    answer=("Only export annotations with this answer", "option", "a", str),
)
def db_out(
    set_id: str,
    out_dir: Optional[Union[str, Path]] = None,
    answer: str = None,
):

    DB = connect()
    examples = DB.get_dataset_examples(set_id)
    if answer:
        examples = [eg for eg in examples if eg.get("answer") == answer]
    if out_dir is None:
        for eg in examples:
            print(srsly.json_dumps(eg))
    else:
        out_dir = Path(out_dir)
        if not out_dir.exists():
            out_dir.mkdir()
        out_file = out_dir / f"{set_id}.jsonl"

I didn't try out these out on the full recipe so there could be a typo but hopefully they give you direction.

Topic		Replies	Views
Labelling a set of images using a custom recipe image	5	500	June 5, 2023
Multilabel Classification for Imaging Dataset	8	323	December 12, 2022
Textcat correct recipe usage , textcat , solved	1	629	September 16, 2020
Editing already annotated images? enhancement , usage , done , image , front-end	4	1568	July 1, 2019
How to add labels to 'image_manual' view_id in custom recipe usage , image , custom , solved	2	903	April 1, 2019

Creating custom labels review recipe to remove noise from the dataset

Related topics