Annotating video regions

Dear Prodigy Team,

I want to annotate regions in videos. Labels are tasks performed in the scene for several minutes. So, rather than going through each image, the UI should allow users to select a region and annotate it for the selected label (i.e., task). I'm wondering if there is a recipe that can be used to create such an interface for speeding up the video annotation task.

I will appreciate your support and guidance to achieve this task.

Many thanks and
Kind regards,

The main recipes for video at the moment focus on selecting the audio segments in it.

I wonder though. Wouldn't it be easier to have a preprocessing step that turns a video and segments it into frames that can be annotated? That way, you could use the image recipes again, but you might also be able to do something clever with the way you select the images. I imagine that two frames that follow each-other in a video are likely similar so it seems wasteful to annotate each and every one of them. You might be able to only select some key frames instead and use a trained model to fill in the blanks.

Might that work?

Dear @koaning

Thanks for the response. I am doing what you suggested for other object detection tasks where keyframes are extracted, labeled, and then used in the model training.

However, this work at hand involves tracking tasks in videos using image sequence models and computing task timings. Annotating actual videos by region would be helpful as the image sequence models would see the whole sequence of images for a task and achieve better performance. The videos I am working on are very long between 2-6 hours in length. Even 1fps yields thousands of images per video. There is no way to extract label keyframes for a reasonable number of videos using the typical image labeling recipe & UI as this is laborious and thereby impractical. The region-based selection of videos similar to the audio recipe in Prodigy can be of great help if we achieved where the labeled output simply captures the time range for regions and annotated tasks or maybe there is a better way.

Any possibility?

Ah wait. My impression was that you were interested in annotating the frame in the video, but now my impression is that you're annotating the video over time. Is this correct?

If so, wouldn't the base interface work?

You'd make the selection in the audio element, but the selected timestamps will still appear when you call prodigy db-out which can be associated to your video.

If I'm misinterpreting the situation: could you give more details about your task? What are you specifically annotating and for what use-case? Maybe that can help me understand your problem better as well.

Dear @koaning, apologies for not describing the use case right the first time.

Yes, I want to annotate the videos over time but would like to generate task labels for each frame from it.

The annotation UI displays a video with frame numbers or time underneath like the audio-video recipe where users can select a region from the audio-like timeline component to annotate encompassing series of frames. The UI allows users to play part of the video where an annotated region is clicked as it currently does with the audio-video recipe. Internally, Prodigy can translate region-based selections into per-frame annotations or we can write a logic to extract it.

I hope this make sense.

I'm still not 100% sure if I understand, but after reading your response some ideas did pop in my mind to consider.

  • Is it possible to consider a two step approach? One step would be to select the frames of interest, which could be an annotation task. Then you could take these annotations and turn them into data ready for another annotation task. Would such a flow work here?
  • Have you seen the blocks feature in Prodigy? It effectively allows you to stack annotation interfaces together. Would that help you here? I'll leave two Prodigy videos with examples below as inspiration.

If these comments didn't help. Could you maybe share a picture of what you're interested in? Sometimes a picture says more than a thousand words and this might be one of those moments.

Dear @koaning

Please find the illustration of what I want to achieve.

I hope this might explain the requirement better.

Ah right, I think I understand now.

My impression is that the current UI can handle this task, but you're interested in a post-processing script that can take the selected timestamps and use those to extract the (frame, label) pairs from the video. Am I understanding it correctly this way?

If so, here's an approach that might work.

The Approach

First, I just use the normal annotation interface. My command is this:

prodigy audio.manual issue-6341 videos --loader video --label LABEL_A,LABEL_B

And this is what the interface looks like:

I've saved these annotations and exported them via db-out. This is what a single example looks like.

  "video": "videos/CleanShot 2023-02-10 at 14.26.05.mp4",
  "text": "CleanShot 2023-02-10 at 14.26.05",
  "meta": {
    "file": "CleanShot 2023-02-10 at 14.26.05.mp4"
  "path": "videos/CleanShot 2023-02-10 at 14.26.05.mp4",
  "_input_hash": -313987848,
  "_task_hash": -469870525,
  "_is_binary": false,
  "_view_id": "audio_manual",
  "audio_spans": [
      "start": 0.2789694763,
      "end": 1.604074489,
      "label": "LABEL_A",
      "id": "c60f9648-d09d-40a8-beef-bd6924c4d484",
      "color": "rgba(255,215,0,0.2)"
      "start": 2.1819398328,
      "end": 3.5070448454,
      "label": "LABEL_B",
      "id": "7efcbfa4-2b1f-465f-b6cf-fc6c97614e25",
      "color": "rgba(0,255,255,0.2)"
      "start": 3.8457934953,
      "end": 5.2306776814,
      "label": "LABEL_A",
      "id": "0cb9e2a7-6f3c-4f00-b921-86c71f6c7f07",
      "color": "rgba(255,215,0,0.2)"
      "start": 5.3502360284,
      "end": 6.0974756972,
      "label": "LABEL_B",
      "id": "2a091039-b560-4310-9008-a92ffb9f2c6d",
      "color": "rgba(0,255,255,0.2)"
  "answer": "accept",
  "_timestamp": 1676036417

This example is expanded to make it easy to view, but I've saved the compact version with one example in a file called annotated.jsonl.

python -m prodigy db-out issue-6341 > annotated.jsonl

I will now use this file to annotate the frames from the video in a custom script.

import srsly 
import numpy as np
import cv2

# First read all annotations into a list
annotations = list(srsly.read_jsonl("annotated.jsonl"))

# Next define a function that can extract frames from 
# a single annotation. 
def to_frames(annot):
    video_path = annot['path']
    spans = annot['audio_spans']
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)
    success, image =
    count = 0
    success = True
    # Loop over all the frames and yield if it falls within a span
    while success:
        success, frame =
        timestamp = count/fps
        for span in spans:
            if timestamp >= span['start']:
                if timestamp <= span['end']:
                    yield {
                        "label": span["label"],
                        "frame": frame,
                        "timestamp": timestamp,
                        "path": video_path
        count += 1

# Define a generator that retreives all annotated frames
g = (frame for annot in annotations for frame in to_frames(annot))

That final generate g contains data that you seem to be interested in.

# dict_keys(['label', 'frame', 'timestamp', 'path'])

You can also inspect the numpy array that represents the image.

import matplotlib.pylab as plt 


This is what I see for my video.


Does this help? You would need to customise this script further, maybe also save the data in a way that works for you, but I think it does what you might want.

@koaning: Exactly, but I didn't think the use of audo.manual the way you described. Thanks.

Can we pass the file name for labels to the prodigy command? I have 27 labels for surgical tasks in a video for annotation. It will be much cleaner to load labels from files when there are many labels.

Another things, I have a hierarchy in the labels as shown below:

--Sub Task 1.1
--Sub Task 1.2
--Sub Task 1.3
--Sub Task 2.1
--Sub Task 2.2
Task 27

Would it be possible to modify the UI to support annotations for hierarchical labels? The UI should look like as shown below:

Exactly, but I didn't think the use of audo.manual the way you described. Thanks.

Happy to hear it :smile:

Would it be possible to modify the UI to support annotations for hierarchical labels?

You can always consider making your own custom HTML template for this, but this would be a fair amount work. I does feel like you're asking for a component that's not natively supported via the blocks mechanic.

Can we pass the file name for labels to the prodigy command? I have 27 labels for surgical tasks in a video for annotation. It will be much cleaner to load labels from files when there are many labels.

What I'm about to propose here is kind of a two-step approach. In the first step, you could select regions of interest. Via something like:

prodigy audio.manual issue-6341 videos --loader video --label REGION_OF_INTEREST

This will allow you to save many regions, like so:

Here's the thing that's nice. You can select regions that really need to have a label attached. And you can also choose to omit regions that do not require a label.

Given that we now have a dataset full of regions of interest, we can move on to an annotation interface where we can attach an appropriate label by using the text-input interface instead the default choice one. This interface can auto complete an input which allows you to more easily select from a large set of options.

Here's what this interface looks like:

CleanShot 2023-02-17 at 11.49.30

The code

Here's the code you need for this custom recipe.

import prodigy
from typing import List

from prodigy.components.db import connect
from prodigy.components.preprocess import fetch_media as fetch_media_preprocessor
from prodigy.util import (
from prodigy.types import TaskType, RecipeSettingsType

def remove_base64(examples: List[TaskType]) -> List[TaskType]:
    """Remove base64-encoded string if "path" is preserved in example."""
    for eg in examples:
        if "audio" in eg and eg["audio"].startswith("data:") and "path" in eg:
            eg["audio"] = eg["path"]
        if "video" in eg and eg["video"].startswith("data:") and "path" in eg:
            eg["video"] = eg["path"]
    return examples

    # fmt: off
    dataset=("Dataset to save annotations to", "positional", None, str),
    source=("Dataset to annotate from", "positional", None, str),
    # fmt: on
def custom(dataset: str, source: str) -> RecipeSettingsType:

    db = connect()
    stream = db.get_dataset_examples(source)

    def split_stream_per_span(stream):
        for item in stream:
            for span in item["audio_spans"]:
                item_copy = {k: v for k, v in item.items()}
                item_copy["audio_spans"] = [span]
                del item_copy["answer"]
                del item_copy["_timestamp"]
                del item_copy["_is_binary"]
                item_copy["video"] = file_to_b64(item_copy["video"])
                yield set_hashes(item_copy, overwrite=True)

    stream = split_stream_per_span(stream)

    log("RECIPE: Starting recipe medical.custom", locals())

    blocks = [
        {"view_id": "audio_manual"},
        {"view_id": "text"},
            "view_id": "text_input",
            "field_rows": 1,
            "field_label": "label",
            "field_id": "user_label",
            "field_autofocus": True,
            "field_suggestions": [
                "Stage Early - Situation Mild",
                "Stage Middle - Situation Mild",
                "Stage End - Situation Mild",
                "Stage Early - Situation Severe",
                "Stage Middle - Situation Severe",
                "Stage End - Situation Severe",

    return {
        "view_id": "blocks",
        "dataset": dataset,
        "stream": stream,
        "before_db": remove_base64,
        "config": {
            "blocks": blocks,
            "labels": ["REGION_OF_INTEREST"],
            "audio_autoplay": False,
            "auto_count_stream": True,

Note how the text input field uses field_suggestions. You'd have to populate this list yourself.

Remember how before we selected regions of interested via:

prodigy audio.manual issue-6341 videos --loader video --label REGION_OF_INTEREST

This recipe can take the issue-6341 dataset and iterate over each span so that you can attach the right label to it.

python -m prodigy medical.custom issue-6341-annot issue-6341 -F

When you annotate this, the annotations for each span will look like this:

{"video":"videos/CleanShot 2023-02-10 at 14.26.05.mp4","text":"CleanShot 2023-02-10 at 14.26.05","meta":{"file":"CleanShot 2023-02-10 at 14.26.05.mp4"},"path":"videos/CleanShot 2023-02-10 at 14.26.05.mp4","_input_hash":-313987848,"_task_hash":-1332879284,"_view_id":"blocks","audio_spans":[{"start":1.1108963076,"end":2.2666269953,"label":"REGION_OF_INTEREST","id":"a25e6bd1-1347-41b1-bc61-2d6995cc61c3","color":"rgba(255,215,0,0.2)"}],"user_label":"Stage Middle - Situation Mild","answer":"accept","_timestamp":1676631261}
{"video":"videos/CleanShot 2023-02-10 at 14.26.05.mp4","text":"CleanShot 2023-02-10 at 14.26.05","meta":{"file":"CleanShot 2023-02-10 at 14.26.05.mp4"},"path":"videos/CleanShot 2023-02-10 at 14.26.05.mp4","_input_hash":-313987848,"_task_hash":386145696,"_view_id":"blocks","audio_spans":[{"start":3.2131305757,"end":4.617941153,"label":"REGION_OF_INTEREST","id":"a163628e-6782-4ff8-acf7-6ae2f425a88f","color":"rgba(255,215,0,0.2)"}],"user_label":"Stage Early - Situation Severe","answer":"accept","_timestamp":1676631263}
{"video":"videos/CleanShot 2023-02-10 at 14.26.05.mp4","text":"CleanShot 2023-02-10 at 14.26.05","meta":{"file":"CleanShot 2023-02-10 at 14.26.05.mp4"},"path":"videos/CleanShot 2023-02-10 at 14.26.05.mp4","_input_hash":-313987848,"_task_hash":17097882,"_view_id":"blocks","audio_spans":[{"start":5.2555856703,"end":6.0227517303,"label":"REGION_OF_INTEREST","id":"fb958448-25d6-4f09-96b2-9693d80a9a19","color":"rgba(255,215,0,0.2)"}],"user_label":"Stage End - Situation Severe","answer":"accept","_timestamp":1676631265}

Notice how each example has a "user_label"? That's string that can also contain the hierarchical information. Note that each jsonline also contains the filename of the video too as well as the filepath.

From here you could use the same trick as before to get frames for each of these spans.

Quick Reflection

On reflection, I think the two step approach might be somewhat preferable to your original suggestion. By splitting the two tasks you end up with two relatively simple tasks that require little mouse-cursor movement. This may make it a lot quicker to annotate and it may also be less error prone.

I may be glancing over some important issues though, so feel free to correct me if I'm wrong.

Let me know!

Final Detail

While working on this I did realize that there currently is one feature missing from the audio interface to make this workflow smooth and that is that the audio cursor always starts at the beginning. For this workflow it'd be better if it would start when the selected span starts. It might also be nice if the interface could subset the audio string. I'll discuss this with the team if this might be a nice feature for the future.

Dear @koaning,

I've been successfully using the audio.manual command to run the Prodigy server for video timeline annotation, following your guidance. It works well for smaller videos.

However, I'm encountering an issue when working with larger videos, which can be up to 5GB in size. Prodigy is throwing the following error:

Could you kindly provide some insights on how to address this problem?

Thank you in advance for your assistance.

Warm regards,

Ah. I've never tried doing this with a 5Gb video before, but I can imagine that it exceeds the comfort zone of the encoding approach that we use. I can also imagine that the javascript library that we use under the hood might break with such files.

As a quick fix, is it possible to maybe the video into smaller chunks? I imagine that the video is so large because it's also a long video. Or is it a very high res video?

The videos we are using at the moment are quite large, reaching up to 5 hours of recording for a few operations in the dataset. The operating room cameras already split these recordings into 10-minute video clips (which is being used at the moment), each captured at a frame rate of 25fps. While additional division of these 10-minute clips is possible, it might lead to a proliferation of smaller clips, which in turn complicates the annotation process for clinical users. Moreover, such division poses challenges for the ongoing action detection task, which requires nuanced hierarchical labelling. Finding a solution that effectively navigates both the video size and the intricacies of annotation is a crucial endeavour. Could you kindly provide information on the maximum video file size that the system can accommodate?

Just so I understand, the 10 minute videos are 5 Gb?

If so, might it help to downsample the videos? As long as you can still make the annotation, the video doesn't have to be high-res. Might that help?

That is right. I used ffmpeg to change video compression and adjust bitrate which reduced the size of the files significantly. Now it is working perfect. Thanks.

1 Like

Hi @koaning,

I'd like to make some further adjustments to the UI layout for video action labelling task:

  1. Move Annotation Buttons to the Top: I'm interested in moving the annotation buttons (accept, reject, etc.) to the top of the page. Additionally, if possible, I'd like to reduce their size slightly.
  2. Reposition Labels: I'd also like to relocate the labels currently displayed at the top of the video to the right side of the page.

Could you please provide some guidance or instructions on how to make these adjustments in my Prodigy project? Your help on this would be greatly appreciated.

Thank you in advance for your assistance.

Best regards,

@koaning: Another quick question; I'm using the following command to run Prodigy for video action detection labelling:

prodigy audio.manual timeline-1002 dataset/videos/surg-001/minis --loader video --label Scope_insert,Setup,Washout

The folder dataset/videos/surg-001/minis contains multiple video clips (e.g., 001.mp4, 002.mp4, and so on). Currently, Prodigy's audio.manual is selecting these clips randomly, which is causing inconvenience for the annotators. Is there a way to configure audio.manual to pick these video clips in ascending order, making the annotation process smoother? Your guidance on resolving this issue would be greatly appreciated.

Thanks and Best Regards

To answer the first question:

I'm not sure if we allow for these changes right now. Could you elaborate by explaining why you'd like these buttons moved? Do you have a screenshot of the layout that demonstrates that moving the buttons would make it better?

There are a lot of edge-cases when it comes to assuming a sorted order that might break the annotator flow. Suppose that we sort for the name of a file ... that might work for some people, but often the name of a file contains the name of a class. And if you have thousands of examples then you might be forced to annotate through 1000 examples before seeing an example from another class.

If you really want tight control over the order in which these examples appear you could always write a custom recipe. But I am curious to learn about your use-case some more just so I understand your painpoint a bit better. Is there a reason why ascending order would be better than random?