LLM for object detection in images


OpenAI API now accepts image urls in chat completions. GPT-4 is able to load the url and provide a description of what is in the image. It can further be prompted to look for certain objects and return the output in a specific format.

OpenAI will return a true or false flag indicating the presence of that object. These will then be used to pre-fill the image annotations.

Is it something possible to do with the current Pordigy LLM implementation?

Secondly, we would like to consider openAI as an annotator and validate its predictions in the review mode. In other words, we could generate prediction for all images from openAI and write them directly to the examples table under an openai-1234 session id.

Can we do this directly in prodigy?

hi @SiteAssist,

Thanks for your question.

We don't currently have that function in our built-in recipes for handling images as it's such a new feature. I've added an internal ticket to look into it.

Probably your best option is to develop a custom recipe, perhaps extending the existing built-in recipes. If you weren't aware, you can view the existing recipes within your existing Prodigy install if you find where your package is installed (run prodigy stats, and view Location:). Then look for the recipes/llm folder.

We do have something very similar with our model-as-annotator recipes:

But that is for text tasks like ner, textcat, and spancat. The review interface doesn't work with images or audio as mentioned in the review docs:

In particular, the image_manual and audio_manual interfaces aren’t supported because the very nature of the UI makes it hard to combine annotations. These interfaces allow users to draw shapes and these may differ due to small differences in pixel values. That doesn’t allow for a great review experience which is why these aren’t supported.

You could likely still do something creative for a review like interface with the custom interfaces and combining it with blocks (e.g., add in a text_input to correct the text description). This would sort of be like the whisper plugin, except instead of audio interface, you'd use "html" (e.g., if you only want the image) and correcting GPT-4's image description.

Hope this helps!