[Bug] Gemini 3.1 Flash Live — VAD does not process audio for first 10-17 seconds of session

Environment:

  • Model: gemini-3.1-flash-live-preview

  • Framework: LiveKit Agents SDK v1.5.10 (Python)

  • Use case: Outbound voice calls via SIP/LiveKit

Problem:
When a new Gemini Live session starts, the Voice Activity Detector (VAD) does not begin processing incoming audio for approximately 10-17 seconds. During this window, the customer speaks (“hello?”) but Gemini does not detect or respond to any speech. After 10-17 seconds, the VAD activates on its own and Gemini begins speaking — but by then the customer has either hung up or said “hello” multiple times with no response.

Expected behavior:
VAD should begin detecting incoming audio within 1-2 seconds of session establishment, allowing the model to hear and respond to the first utterance.

Actual behavior:

  • Session connects at T+0

  • Customer speaks at T+2 (“Hello?”) — Gemini does not detect this

  • Customer speaks again at T+8 (“Hello?”) — still not detected

  • At T+10 to T+17, Gemini’s VAD suddenly activates and the model starts speaking unprompted (ignoring its instruction to wait for customer speech)

  • generate_reply() called during the dead window results in received server content but no active generation warning

Workaround attempted:
We send noise audio (200ms, 16kHz, amplitude 1500) + activityStart/activityEnd triggers + text instruction via LiveClientRealtimeInput immediately after session start to “wake” the VAD. This is accepted by the session but does not fix the VAD delay. The received server content but no active generation warning appears and the audio is ignored.

Reproduction steps:

  1. Create a Gemini 3.1 Flash Live session with voice enabled

  2. Start streaming audio from a SIP participant immediately

  3. Observe that Gemini does not respond to any speech for 10-17 seconds

  4. After the delay, Gemini begins speaking on its own regardless of instructions

Impact:
This makes Gemini 3.1 unusable for real-time voice applications where the model needs to listen and respond to the first utterance (outbound calls, IVR, customer service). Customers hear 10-17 seconds of silence and hang up.

Questions:

  1. Is the VAD warmup delay a known limitation of the preview model? Is there a timeline for a fix?

  2. Is there a recommended way to “prime” the VAD so it begins listening immediately at session start?

  3. Does generate_reply() / sending LiveClientRealtimeInput with audio data at session start have any effect on VAD activation, or is the warmup period fixed regardless?

  4. Is there a configuration parameter (e.g., VAD sensitivity, warmup timeout) that we’re missing?

  5. Will this be addressed in the GA release of Gemini 3.1 Flash Live?