Is it possible to use the Multimodal Live API of Vertex AI to send images (instead of just text, audio, or video) as input for analysis?
The service flow I’m planning to implement is as follows. I’d also like to get feedback from developers on whether this approach is valid or if there’s a better alternative.
Model intended to be used: Gemini 2.0 Flash
Flow:
- In a supermarket, the user opens a Unity app with the camera turned on and says via microphone: "Let me know when you see an apple.”
- Once microphone input starts, the app captures image frames every second and periodically sends them via API to a Python (proxy) server using sockets.
- The Python server calls Vertex AI’s Multimodal Live API via socket, sending both the voice prompt and the captured image in real-time.
- If the Multimodal Live API detects an apple in the image, it responds back to the server, which then forwards the result (in TTS format) back to the Unity client app.