Yay, I like that spirit
I haven't worked on object tracking use cases myself, but just to make sure I understand the requirements correctly: your goal is essentially to annotate a given object / bounding box across the frames of your video, right? Since you'd be working on the individual frames, one option could be to use a workflow like image.manual
and stream in the selected frames as images, in order. You can probably do this all programmatically in Python as part of your stream, using something like OpenCV and the video files as the input. In the JSON you send our, you'd then include the base64-encoded image data of the frame and the timestamp, so you'll always be able to match the frames back to the original video.
It'll probably be fine to only pick every n-th frame, depending on the objects and how they move. Potentially, you can even try to be more clever than that and calculate whether a given frame is substantially different from the previous one you send out (e.g. by comparing the pixels, or maybe something a bit more sophisticated). This way, you can automatically skip frames if nothing happens/moves. (Similarly, having a model predict "no object" pretty accurately could be quite easy and might save you some time as well.)
It also seems like the most reasonable data point to focus on is the center of the object you're tracking, right? If the center is accurate, a model could likely infer the bounding box from it, so it seems inefficient to try and draw a pixel-perfect on every frame, or resize and move the box every time. Setting "image_manual_from_center": true
lets you draw bounding boxes from the center, which can be a lot faster and might be worth experimenting with.
These are just some ideas that came to mind thinking about the task – there might be other aspects I haven't considered, so let me know if there's anything else (or if you end up implementing an efficient workflow!)