I'm hoping to use prodigy to annotate a few different datasets. The image annotation capabilities that are provided out of the box look great. I was also hoping to annotate some video datasets with prodigy, but upon investigating the audio & video docs it seems like today the main video annotation support is for labeling audio in relation to a video. Searching the forums I found that object tracking is not currently implemented, but that doesn't mean it's not possible
From there I was wondering if it would be possible to do object tracking and annotation with prodi.gy? I know this is likely to involve work on my side, but I didn't know what challenges or roadblocks I may face and wanted to see if there was any advice around this first? I am hoping to accomplish something like ulabel or cvat, but with some of the tooling prodigy provides since I can use it for image annotation and other labeling work. That said if it's not a good idea right now that's good to know too.
I haven't worked on object tracking use cases myself, but just to make sure I understand the requirements correctly: your goal is essentially to annotate a given object / bounding box across the frames of your video, right? Since you'd be working on the individual frames, one option could be to use a workflow like image.manual and stream in the selected frames as images, in order. You can probably do this all programmatically in Python as part of your stream, using something like OpenCV and the video files as the input. In the JSON you send our, you'd then include the base64-encoded image data of the frame and the timestamp, so you'll always be able to match the frames back to the original video.
It'll probably be fine to only pick every n-th frame, depending on the objects and how they move. Potentially, you can even try to be more clever than that and calculate whether a given frame is substantially different from the previous one you send out (e.g. by comparing the pixels, or maybe something a bit more sophisticated). This way, you can automatically skip frames if nothing happens/moves. (Similarly, having a model predict "no object" pretty accurately could be quite easy and might save you some time as well.)
It also seems like the most reasonable data point to focus on is the center of the object you're tracking, right? If the center is accurate, a model could likely infer the bounding box from it, so it seems inefficient to try and draw a pixel-perfect on every frame, or resize and move the box every time. Setting "image_manual_from_center": true lets you draw bounding boxes from the center, which can be a lot faster and might be worth experimenting with.
These are just some ideas that came to mind thinking about the task – there might be other aspects I haven't considered, so let me know if there's anything else (or if you end up implementing an efficient workflow!)
You are correct. My goal is to annotate a given object across frames which means it can be turned into an image annotation workflow with the images being part of a video instead of completely independent. Your suggestions make complete sense, and I really appreciate the ideas, links and feedback. It also gives me more confidence that this is achievable instead of something that only worked in my head before investing a lot of time .