There isn’t a real difference, the “Explore vision capabilities with the Gemini API” document was created significantly later than the “Text generation” document (as the models evolved), so the more recent documentation (“Explore…”) goes into more technical depth and shows features that were first introduced in the Gemini 2.0 family like the bounding boxes.
The underlying API is the same, it’s that historically speaking, the text only models showed up first, followed by multimodal models, followed by multimodal models with improved functionality.