[Bug] Gemini 3.1 Flash Live — VAD does not process audio for first 10-17 seconds of session

Gaurav_Chand · May 21, 2026, 7:21am

Environment:

Model: gemini-3.1-flash-live-preview
Framework: LiveKit Agents SDK v1.5.10 (Python)
Use case: Outbound voice calls via SIP/LiveKit

Problem:
When a new Gemini Live session starts, the Voice Activity Detector (VAD) does not begin processing incoming audio for approximately 10-17 seconds. During this window, the customer speaks (“hello?”) but Gemini does not detect or respond to any speech. After 10-17 seconds, the VAD activates on its own and Gemini begins speaking — but by then the customer has either hung up or said “hello” multiple times with no response.

Expected behavior:
VAD should begin detecting incoming audio within 1-2 seconds of session establishment, allowing the model to hear and respond to the first utterance.

Actual behavior:

Session connects at T+0
Customer speaks at T+2 (“Hello?”) — Gemini does not detect this
Customer speaks again at T+8 (“Hello?”) — still not detected
At T+10 to T+17, Gemini’s VAD suddenly activates and the model starts speaking unprompted (ignoring its instruction to wait for customer speech)
generate_reply() called during the dead window results in received server content but no active generation warning

Workaround attempted:
We send noise audio (200ms, 16kHz, amplitude 1500) + activityStart/activityEnd triggers + text instruction via LiveClientRealtimeInput immediately after session start to “wake” the VAD. This is accepted by the session but does not fix the VAD delay. The received server content but no active generation warning appears and the audio is ignored.

Reproduction steps:

Create a Gemini 3.1 Flash Live session with voice enabled
Start streaming audio from a SIP participant immediately
Observe that Gemini does not respond to any speech for 10-17 seconds
After the delay, Gemini begins speaking on its own regardless of instructions

Impact:
This makes Gemini 3.1 unusable for real-time voice applications where the model needs to listen and respond to the first utterance (outbound calls, IVR, customer service). Customers hear 10-17 seconds of silence and hang up.

Questions:

Is the VAD warmup delay a known limitation of the preview model? Is there a timeline for a fix?
Is there a recommended way to “prime” the VAD so it begins listening immediately at session start?
Does generate_reply() / sending LiveClientRealtimeInput with audio data at session start have any effect on VAD activation, or is the warmup period fixed regardless?
Is there a configuration parameter (e.g., VAD sensitivity, warmup timeout) that we’re missing?
Will this be addressed in the GA release of Gemini 3.1 Flash Live?

Topic		Replies	Views
Gemini Live Flash 3.1 API: inputTranscription no longer streams incrementally Gemini API feedback , gemini	4	239	April 23, 2026
Gemini 3.1 Flash Live — audio input via WebSocket never triggers a response Gemini API ai-studio , bug , gemini	0	26	May 14, 2026
Gemini Live API (Native Audio): Response Latency Gradually Increases During Long Sessions Gemini API gemini-api , gemini , audio	1	289	March 6, 2026
Gemini mini live voice agent goes silent repeatedly Gemini API audio , gemini-flash	1	170	August 11, 2025
Live API : 5-6 second Response Latency Gemini API gemini , live-streaming	2	222	March 25, 2026

[Bug] Gemini 3.1 Flash Live — VAD does not process audio for first 10-17 seconds of session

Related topics