In light of the information provided in the Technical Report for Gemini 1.5 Pro, it appears the model is designed to be multi-modal. This leads me to understand that it has capabilities for processing both video and audio inputs. (at minimum >= 10mins length of video without extracting text from video etc.)
Does this iteration continue the practice of employing distinct modules for vision and audio, such as Gemini Vision and Gemini Audio, respectively? I’m eager to find out whether the Gemini API consolidates these functionalities, allowing for a versatile approach in handling text, audio, and video inputs.