Hi! The audio and video UI support annotating segments in the audio track – there's currently no interface for object tracking annotation in video files.
However, if your goal is to just mark the speaker (and not track the speaker's movement across frames etc.), you could probably use a combination of the audio
or video
interface, and the image_manual
interface with a still image from the video. See here for the docs on custom interfaces with blocks. The solution here really depends on what your end goal is, what type of structured information you want to extract and what you're planning on using that structured data for later on.