Using Vertex AI Multimodal Live API with Image Inputs in a Real-Time Unity App Scenario

user1705 · April 8, 2025, 4:44am

Is it possible to use the Multimodal Live API of Vertex AI to send images (instead of just text, audio, or video) as input for analysis?

The service flow I’m planning to implement is as follows. I’d also like to get feedback from developers on whether this approach is valid or if there’s a better alternative.

Model intended to be used: Gemini 2.0 Flash

Flow:

In a supermarket, the user opens a Unity app with the camera turned on and says via microphone: "Let me know when you see an apple.”
Once microphone input starts, the app captures image frames every second and periodically sends them via API to a Python (proxy) server using sockets.
The Python server calls Vertex AI’s Multimodal Live API via socket, sending both the voice prompt and the captured image in real-time.
If the Multimodal Live API detects an apple in the image, it responds back to the server, which then forwards the result (in TTS format) back to the Unity client app.

Akhilesh_Kambhampati · June 11, 2025, 8:23pm

@user1705 ,

yes gemini “Live API” allows you to input camera and audio feed.

please check this colab for implementation.

Topic		Replies	Views
Python Implementation for Real-time Video Stream Analysis with Gemini 2.0 Multimodal Live API Gemini API api	2	392	December 30, 2024
Will it be possible to receive text and audio data in the multimodal API? Gemini API models , gemini-api	12	738	June 12, 2025
Gemini flash Live API docs chaos sorted out: Documentation api , models	6	471	May 2, 2025
How to process uploaded image into a multimodal image content without using PIL on python? Gemini API api , python	3	50	May 21, 2025
Gemini live api doesn't recognize screenshot images Gemini API api , gemini-api , python , live-streaming	1	37	June 25, 2025

Using Vertex AI Multimodal Live API with Image Inputs in a Real-Time Unity App Scenario

Related topics