Draw a shape and label a person and their behavior in a video frame

Hi! Are the labels in the second dimension (standing, touching etc.) pre-defined, or is this more of a free-form thing, depending on what's happening in the frame? If it's more like a free-form caption, you could use an interface with two blocks: image_manual to annotate the bounding boxes and text_input for the caption. See here for an example: https://prodi.gy/docs/custom-interfaces#blocks

If you have pre-defined labels for the second dimension, you could make two passes over the data: first, annotate the objects (person etc.), and review the annotations/data quality if needed. Next, stream in each bounding box and the second layer of labels, e.g. as multiple choice options (choice UI). This also makes it easier to validate and evaluate the two layers separately, and fix mistakes early (because you know that if the bounding box is wrong, any subsequent annotation will most likely be wrong, too).