Text generation or vision understanding to describe an image?

Hello,

this is the doc for google gemini vision understanding @Explore vision capabilities with the Gemini API  |  Google AI for Developers
and this is the doc for google gemini text generation @Text generation  |  Gemini API  |  Google AI for Developers

in both case there is a feature to input an image and ask a question about the image, what is the difference between both approach?

Welcome to the forum.

There isn’t a real difference, the “Explore vision capabilities with the Gemini API” document was created significantly later than the “Text generation” document (as the models evolved), so the more recent documentation (“Explore…”) goes into more technical depth and shows features that were first introduced in the Gemini 2.0 family like the bounding boxes.

The underlying API is the same, it’s that historically speaking, the text only models showed up first, followed by multimodal models, followed by multimodal models with improved functionality.

Hope that helps.