Has Anyone Gained Access to Gemini 1.5 Pro API? (Re: Gemini 1.5 Pro API's Multimodal Features)

In light of the information provided in the Technical Report for Gemini 1.5 Pro, it appears the model is designed to be multi-modal. This leads me to understand that it has capabilities for processing both video and audio inputs. (at minimum >= 10mins length of video without extracting text from video etc.)

Does this iteration continue the practice of employing distinct modules for vision and audio, such as Gemini Vision and Gemini Audio, respectively? I’m eager to find out whether the Gemini API consolidates these functionalities, allowing for a versatile approach in handling text, audio, and video inputs.

1 Like

I am using 1.5 Pro API from yesterday, it seems to me your prediction is right. There won’t be any distinct modules, as it is multi-modal now it will serve as a combined approach to handle text, audio, and video inputs.

1 Like

yes, it seems that in Gemini 1.5 it’s combined gemini pro and pro vision

2 Likes

Late to the party, but you can go ahead and use Gemini 1.5 Pro in Google AI Studio and in the Gemini API.

Yes, it’s a multimodal model, which means it’s a single model that handles text, code, images, audio, and video.

You can find examples here:

4 Likes