Gemini Live API (Vertex AI Beta, March 2026) — One Developer’s Lessons Building Voice Apps
I’ve been prototyping real-time voice apps on the Gemini Live API (Preview SDK and Vertex AI GA) for several months — a voice-driven browser extension, a media co-pilot, and a collection of multiplayer games. Along the way I’ve run 55+ E2E test runs and catalogued every failure mode, workaround, and pattern that actually works.
I wrote it all up in a comprehensive guide: Gemini Live API (Vertex AI Beta, March 2026) — One Developer’s Lessons Building Voice Apps
Below is a summary of the highest-impact findings. The full guide has code examples, architecture diagrams, and a migration checklist.
The Big One: Fire-and-Forget Tools
If you’re using tools with Gemini Live on Vertex AI GA, you’ve probably noticed that every tool response triggers the model to narrate the result — “I’ve updated the score…”, “Let me log that…” — even when you don’t want it to.
This is because FunctionResponseScheduling.SILENT is silently stripped by the Vertex protobuf. The model never receives the scheduling hint.
The fix: Don’t send tool responses back at all. Execute tools client-side and discard the result from the model’s perspective. For tools that carry state the model needs (round counts, etc.), buffer the results and inject them as clientContent after the model finishes speaking.
But fire-and-forget alone isn’t enough. The model still occasionally narrates tool calls even without receiving a response. You need a four-layer defense:
-
"SILENT EXECUTION."in every tool description string -
Explicit SI rules: “Say nothing after tool calls. Call the tool and stop.”
-
Fire-and-forget (no
sendToolResponse()) -
Client-side audio gating — drop audio chunks after tool calls until turn completes
In testing, layers 1-3 together still failed 67% of the time. Adding layer 4 brought it to 0% failure across 10 consecutive test runs.
Context Injection is Dangerous
Any sendClientContent with turnComplete: true while the model is speaking interrupts it. The model stops mid-sentence and starts a new response.
This means you can’t inject state mid-turn. You need to buffer context messages and flush them only after the model’s turn is fully complete — which requires tracking both the server’s turnComplete signal AND the last audio chunk finishing playback. Neither signal alone is reliable.
The full guide has the implementation pattern with code.
Session Architecture: The “Forgetful” vs. “Severed” Dilemma
Every voice app has structured data the model can’t forget (page content, video metadata, product catalog). When that data changes mid-conversation, you’re stuck:
-
Keep the socket open and inject new data as chat → it slides out of the context window within 10-15 minutes
-
Disconnect and reconnect with a fresh system instruction → ~500ms-1s dead air, VAD reset, audio glitches
We use managed session cycling (Approach B, refined): flush transcript to a session log, close the session (but keep mic alive), rebuild the SI with history + new data, reconnect. The model role-plays continuity via the injected history. It works surprisingly well but is architecturally expensive — every app has to build its own session lifecycle manager.
Known Model Behavioral Rates
From 40+ E2E test runs across three apps:
| Behavior | Rate | Notes |
|—|—|—|
| MALFORMED_FUNCTION_CALL | ~30% of sessions | Model produces invalid JSON. Recovers with text nudge. |
| Duplicate tool calls | ~20% of sessions | Same tool fired twice for same event. Need client-side dedup. |
| Tool narration (without audio gating) | ~40% of tool calls | “Let me update that…” despite SILENT instructions |
| Auto-answering instead of waiting | ~50% with topic context | Model fills in user responses instead of waiting for input |
All of these require client-side mitigation. The full guide covers each one with code.
System Instruction Design Tips
-
Put everything in the SI before connecting. Any
sendTextmid-session acts as a user turn and interrupts the model. -
Use numbered phases (“STEP 1: SETUP”, “STEP 2: GAMEPLAY”). The model transitions naturally.
-
Be extremely prescriptive about tool flow. The model can’t see tool results (fire-and-forget), so spell out exact steps (a, b, c, d).
-
Pin the language. The native audio model code-switches on accent or background noise. Add: “You MUST speak only in English.”
-
Single source of truth. If you inject a follow-up via context, remove that instruction from the SI. Otherwise the model does it twice.
Testing Voice Apps
We developed three testing tiers:
| Tier | Method | Speed |
|—|—|—|
| Text injection | sendText() via React fiber traversal | ~5 min/run |
| Voice via virtual audio | TTS → BlackHole → Chrome mic | ~10 min/run |
| Manual | Talk to it yourself | ~20 min/run |
Key discipline: N=3 minimum for any behavioral claim. Report raw counts (“2/3 passed”), not impressions.
The full guide is here: Gemini Live API (Vertex AI Beta, March 2026) — One Developer’s Lessons Building Voice Apps
It covers all of the above in detail plus: the WebSocket proxy architecture, the setup message reference, turn completion mechanics, AI speaking state debounce, session resumption, voice-driven UI navigation, and a complete migration checklist from Preview SDK to Vertex AI GA.
Happy to answer questions or compare notes with anyone else building on Live. What patterns have you found?