Annotating video regions

Dear Prodigy Team,

I want to annotate regions in videos. Labels are tasks performed in the scene for several minutes. So, rather than going through each image, the UI should allow users to select a region and annotate it for the selected label (i.e., task). I'm wondering if there is a recipe that can be used to create such an interface for speeding up the video annotation task.

I will appreciate your support and guidance to achieve this task.

Many thanks and
Kind regards,

The main recipes for video at the moment focus on selecting the audio segments in it.

I wonder though. Wouldn't it be easier to have a preprocessing step that turns a video and segments it into frames that can be annotated? That way, you could use the image recipes again, but you might also be able to do something clever with the way you select the images. I imagine that two frames that follow each-other in a video are likely similar so it seems wasteful to annotate each and every one of them. You might be able to only select some key frames instead and use a trained model to fill in the blanks.

Might that work?

Dear @koaning

Thanks for the response. I am doing what you suggested for other object detection tasks where keyframes are extracted, labeled, and then used in the model training.

However, this work at hand involves tracking tasks in videos using image sequence models and computing task timings. Annotating actual videos by region would be helpful as the image sequence models would see the whole sequence of images for a task and achieve better performance. The videos I am working on are very long between 2-6 hours in length. Even 1fps yields thousands of images per video. There is no way to extract label keyframes for a reasonable number of videos using the typical image labeling recipe & UI as this is laborious and thereby impractical. The region-based selection of videos similar to the audio recipe in Prodigy can be of great help if we achieved where the labeled output simply captures the time range for regions and annotated tasks or maybe there is a better way.

Any possibility?

Ah wait. My impression was that you were interested in annotating the frame in the video, but now my impression is that you're annotating the video over time. Is this correct?

If so, wouldn't the base interface work?

You'd make the selection in the audio element, but the selected timestamps will still appear when you call prodigy db-out which can be associated to your video.

If I'm misinterpreting the situation: could you give more details about your task? What are you specifically annotating and for what use-case? Maybe that can help me understand your problem better as well.

Dear @koaning, apologies for not describing the use case right the first time.

Yes, I want to annotate the videos over time but would like to generate task labels for each frame from it.

The annotation UI displays a video with frame numbers or time underneath like the audio-video recipe where users can select a region from the audio-like timeline component to annotate encompassing series of frames. The UI allows users to play part of the video where an annotated region is clicked as it currently does with the audio-video recipe. Internally, Prodigy can translate region-based selections into per-frame annotations or we can write a logic to extract it.

I hope this make sense.

I'm still not 100% sure if I understand, but after reading your response some ideas did pop in my mind to consider.

  • Is it possible to consider a two step approach? One step would be to select the frames of interest, which could be an annotation task. Then you could take these annotations and turn them into data ready for another annotation task. Would such a flow work here?
  • Have you seen the blocks feature in Prodigy? It effectively allows you to stack annotation interfaces together. Would that help you here? I'll leave two Prodigy videos with examples below as inspiration.

If these comments didn't help. Could you maybe share a picture of what you're interested in? Sometimes a picture says more than a thousand words and this might be one of those moments.

Dear @koaning

Please find the illustration of what I want to achieve.

I hope this might explain the requirement better.

Ah right, I think I understand now.

My impression is that the current UI can handle this task, but you're interested in a post-processing script that can take the selected timestamps and use those to extract the (frame, label) pairs from the video. Am I understanding it correctly this way?

If so, here's an approach that might work.

The Approach

First, I just use the normal annotation interface. My command is this:

prodigy audio.manual issue-6341 videos --loader video --label LABEL_A,LABEL_B

And this is what the interface looks like:

I've saved these annotations and exported them via db-out. This is what a single example looks like.

  "video": "videos/CleanShot 2023-02-10 at 14.26.05.mp4",
  "text": "CleanShot 2023-02-10 at 14.26.05",
  "meta": {
    "file": "CleanShot 2023-02-10 at 14.26.05.mp4"
  "path": "videos/CleanShot 2023-02-10 at 14.26.05.mp4",
  "_input_hash": -313987848,
  "_task_hash": -469870525,
  "_is_binary": false,
  "_view_id": "audio_manual",
  "audio_spans": [
      "start": 0.2789694763,
      "end": 1.604074489,
      "label": "LABEL_A",
      "id": "c60f9648-d09d-40a8-beef-bd6924c4d484",
      "color": "rgba(255,215,0,0.2)"
      "start": 2.1819398328,
      "end": 3.5070448454,
      "label": "LABEL_B",
      "id": "7efcbfa4-2b1f-465f-b6cf-fc6c97614e25",
      "color": "rgba(0,255,255,0.2)"
      "start": 3.8457934953,
      "end": 5.2306776814,
      "label": "LABEL_A",
      "id": "0cb9e2a7-6f3c-4f00-b921-86c71f6c7f07",
      "color": "rgba(255,215,0,0.2)"
      "start": 5.3502360284,
      "end": 6.0974756972,
      "label": "LABEL_B",
      "id": "2a091039-b560-4310-9008-a92ffb9f2c6d",
      "color": "rgba(0,255,255,0.2)"
  "answer": "accept",
  "_timestamp": 1676036417

This example is expanded to make it easy to view, but I've saved the compact version with one example in a file called annotated.jsonl.

python -m prodigy db-out issue-6341 > annotated.jsonl

I will now use this file to annotate the frames from the video in a custom script.

import srsly 
import numpy as np
import cv2

# First read all annotations into a list
annotations = list(srsly.read_jsonl("annotated.jsonl"))

# Next define a function that can extract frames from 
# a single annotation. 
def to_frames(annot):
    video_path = annot['path']
    spans = annot['audio_spans']
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    success, image =
    count = 0
    success = True
    # Loop over all the frames and yield if it falls within a span
    while success:
        success, frame =
        timestamp = count/fps
        for span in spans:
            if timestamp >= span['start']:
                if timestamp <= span['end']:
                    yield {
                        "label": span["label"],
                        "frame": frame,
                        "timestamp": timestamp,
                        "path": video_path
        count += 1

# Define a generator that retreives all annotated frames
g = (frame for annot in annotations for frame in to_frames(annot))

That final generate g contains data that you seem to be interested in.

# dict_keys(['label', 'frame', 'timestamp', 'path'])

You can also inspect the numpy array that represents the image.

import matplotlib.pylab as plt 


This is what I see for my video.


Does this help? You would need to customise this script further, maybe also save the data in a way that works for you, but I think it does what you might want.

@koaning: Exactly, but I didn't think the use of audo.manual the way you described. Thanks.

Can we pass the file name for labels to the prodigy command? I have 27 labels for surgical tasks in a video for annotation. It will be much cleaner to load labels from files when there are many labels.

Another things, I have a hierarchy in the labels as shown below:

--Sub Task 1.1
--Sub Task 1.2
--Sub Task 1.3
--Sub Task 2.1
--Sub Task 2.2
Task 27

Would it be possible to modify the UI to support annotations for hierarchical labels? The UI should look like as shown below:

Exactly, but I didn't think the use of audo.manual the way you described. Thanks.

Happy to hear it :smile:

Would it be possible to modify the UI to support annotations for hierarchical labels?

You can always consider making your own custom HTML template for this, but this would be a fair amount work. I does feel like you're asking for a component that's not natively supported via the blocks mechanic.

Can we pass the file name for labels to the prodigy command? I have 27 labels for surgical tasks in a video for annotation. It will be much cleaner to load labels from files when there are many labels.

What I'm about to propose here is kind of a two-step approach. In the first step, you could select regions of interest. Via something like:

prodigy audio.manual issue-6341 videos --loader video --label REGION_OF_INTEREST

This will allow you to save many regions, like so:

Here's the thing that's nice. You can select regions that really need to have a label attached. And you can also choose to omit regions that do not require a label.

Given that we now have a dataset full of regions of interest, we can move on to an annotation interface where we can attach an appropriate label by using the text-input interface instead the default choice one. This interface can auto complete an input which allows you to more easily select from a large set of options.

Here's what this interface looks like:

CleanShot 2023-02-17 at 11.49.30

The code

Here's the code you need for this custom recipe.

import prodigy
from typing import List

from prodigy.components.db import connect
from prodigy.components.preprocess import fetch_media as fetch_media_preprocessor
from prodigy.util import (
from prodigy.types import TaskType, RecipeSettingsType

def remove_base64(examples: List[TaskType]) -> List[TaskType]:
    """Remove base64-encoded string if "path" is preserved in example."""
    for eg in examples:
        if "audio" in eg and eg["audio"].startswith("data:") and "path" in eg:
            eg["audio"] = eg["path"]
        if "video" in eg and eg["video"].startswith("data:") and "path" in eg:
            eg["video"] = eg["path"]
    return examples

    # fmt: off
    dataset=("Dataset to save annotations to", "positional", None, str),
    source=("Dataset to annotate from", "positional", None, str),
    # fmt: on
def custom(dataset: str, source: str) -> RecipeSettingsType:

    db = connect()
    stream = db.get_dataset_examples(source)

    def split_stream_per_span(stream):
        for item in stream:
            for span in item["audio_spans"]:
                item_copy = {k: v for k, v in item.items()}
                item_copy["audio_spans"] = [span]
                del item_copy["answer"]
                del item_copy["_timestamp"]
                del item_copy["_is_binary"]
                item_copy["video"] = file_to_b64(item_copy["video"])
                yield set_hashes(item_copy, overwrite=True)

    stream = split_stream_per_span(stream)

    log("RECIPE: Starting recipe medical.custom", locals())

    blocks = [
        {"view_id": "audio_manual"},
        {"view_id": "text"},
            "view_id": "text_input",
            "field_rows": 1,
            "field_label": "label",
            "field_id": "user_label",
            "field_autofocus": True,
            "field_suggestions": [
                "Stage Early - Situation Mild",
                "Stage Middle - Situation Mild",
                "Stage End - Situation Mild",
                "Stage Early - Situation Severe",
                "Stage Middle - Situation Severe",
                "Stage End - Situation Severe",

    return {
        "view_id": "blocks",
        "dataset": dataset,
        "stream": stream,
        "before_db": remove_base64,
        "config": {
            "blocks": blocks,
            "labels": ["REGION_OF_INTEREST"],
            "audio_autoplay": False,
            "auto_count_stream": True,

Note how the text input field uses field_suggestions. You'd have to populate this list yourself.

Remember how before we selected regions of interested via:

prodigy audio.manual issue-6341 videos --loader video --label REGION_OF_INTEREST

This recipe can take the issue-6341 dataset and iterate over each span so that you can attach the right label to it.

python -m prodigy medical.custom issue-6341-annot issue-6341 -F

When you annotate this, the annotations for each span will look like this:

{"video":"videos/CleanShot 2023-02-10 at 14.26.05.mp4","text":"CleanShot 2023-02-10 at 14.26.05","meta":{"file":"CleanShot 2023-02-10 at 14.26.05.mp4"},"path":"videos/CleanShot 2023-02-10 at 14.26.05.mp4","_input_hash":-313987848,"_task_hash":-1332879284,"_view_id":"blocks","audio_spans":[{"start":1.1108963076,"end":2.2666269953,"label":"REGION_OF_INTEREST","id":"a25e6bd1-1347-41b1-bc61-2d6995cc61c3","color":"rgba(255,215,0,0.2)"}],"user_label":"Stage Middle - Situation Mild","answer":"accept","_timestamp":1676631261}
{"video":"videos/CleanShot 2023-02-10 at 14.26.05.mp4","text":"CleanShot 2023-02-10 at 14.26.05","meta":{"file":"CleanShot 2023-02-10 at 14.26.05.mp4"},"path":"videos/CleanShot 2023-02-10 at 14.26.05.mp4","_input_hash":-313987848,"_task_hash":386145696,"_view_id":"blocks","audio_spans":[{"start":3.2131305757,"end":4.617941153,"label":"REGION_OF_INTEREST","id":"a163628e-6782-4ff8-acf7-6ae2f425a88f","color":"rgba(255,215,0,0.2)"}],"user_label":"Stage Early - Situation Severe","answer":"accept","_timestamp":1676631263}
{"video":"videos/CleanShot 2023-02-10 at 14.26.05.mp4","text":"CleanShot 2023-02-10 at 14.26.05","meta":{"file":"CleanShot 2023-02-10 at 14.26.05.mp4"},"path":"videos/CleanShot 2023-02-10 at 14.26.05.mp4","_input_hash":-313987848,"_task_hash":17097882,"_view_id":"blocks","audio_spans":[{"start":5.2555856703,"end":6.0227517303,"label":"REGION_OF_INTEREST","id":"fb958448-25d6-4f09-96b2-9693d80a9a19","color":"rgba(255,215,0,0.2)"}],"user_label":"Stage End - Situation Severe","answer":"accept","_timestamp":1676631265}

Notice how each example has a "user_label"? That's string that can also contain the hierarchical information. Note that each jsonline also contains the filename of the video too as well as the filepath.

From here you could use the same trick as before to get frames for each of these spans.

Quick Reflection

On reflection, I think the two step approach might be somewhat preferable to your original suggestion. By splitting the two tasks you end up with two relatively simple tasks that require little mouse-cursor movement. This may make it a lot quicker to annotate and it may also be less error prone.

I may be glancing over some important issues though, so feel free to correct me if I'm wrong.

Let me know!

Final Detail

While working on this I did realize that there currently is one feature missing from the audio interface to make this workflow smooth and that is that the audio cursor always starts at the beginning. For this workflow it'd be better if it would start when the selected span starts. It might also be nice if the interface could subset the audio string. I'll discuss this with the team if this might be a nice feature for the future.