Gemini-2.5-flash-native-audio-preview with manual VAD (disabled: True) - Gemini never responds after ActivityEnd, session dies with 1011 keepalive ping timeout

Model: gemini-2.5-flash-native-audio-preview-12-2025 SDK: Python google-genai, WebRTC audio bridge via aiortc VAD mode: Manual -automatic_activity_detection: { disabled: True } Thinking: thinking_budget: 0 Tools: Yes - 4 custom function declarations

Setup

We’re building an AI interview platform over WebRTC. The audio pipeline is:

  • Browser mic → aiortc WebRTC track → resampled 48kHz→16kHz → send_realtime_input(audio=...)

  • We disabled auto-VAD and implemented our own amplitude-based VAD on the server

  • On speech start: send_realtime_input(activity_start=ActivityStart())

  • On speech end: send_realtime_input(activity_end=ActivityEnd())

The Gemini session opens, the agent greets the candidate (turn_complete fires correctly), and then the candidate speaks. After this the session consistently dies with:

websockets.exceptions.ConnectionClosedError: sent 1011 (internal error) keepalive ping timeout; no close frame received

Two failure modes we’ve observed

Mode A: VAD fires speech END, ActivityEnd is sent, but Gemini never responds. Audio chunks stop, no turn_complete ever arrives, 1011 after ~10–12 seconds.

Mode B (more common): VAD speech START fires, audio streams for 30–60 seconds, but VAD speech END NEVER fires because when the candidate stops talking, track.recv() times out (no frames) rather than delivering silence - so our VAD silence counter never accumulates enough to trigger ActivityEnd. Gemini sits waiting. 1011 after ~10–12 seconds.


Log excerpt (Mode B - no speech END)

14:05:43 [GeminiLive] turn_complete — agent greeted candidate
14:05:44 [Bridge] VAD unlocked — mic ready
14:05:45 [Bridge] VAD: speech START
14:05:45 [Connection] Candidate activity START → send_activity_start() called
... 1900 chunks sent over ~38 seconds ...
14:06:23 last audio chunk logged
(11 seconds of nothing)
14:06:34 [Agent] ERROR: sent 1011 (internal error) keepalive ping timeout

No VAD: speech END log. No ActivityEnd ever sent. Gemini waits forever.


What we’ve tried / discovered

  1. Removed send_audio_stream_end() after ActivityEnd - we had been calling it as a “flush” after ActivityEnd, based on a community suggestion. Per the Vertex AI reference docs, “An AudioStreamEnd isn’t sent in this configuration. Instead, any interruption of the stream is marked by an ActivityEnd message.” Removing it didn’t fully resolve the issue.

  2. Draining the inbound queue before ActivityEnd - implemented a queue drain before sending ActivityEnd to prevent post-boundary audio from reaching Gemini.

  3. Track-timeout silence counting - modified our VAD to count WebRTC track timeouts as silence accumulation, so speech END fires even when track.recv() stops delivering frames. This partially helps Mode B.

  4. _activity_ended flag on send_audio - blocks the audio send loop from sending chunks after ActivityEnd.

I will appreciate any guidance with these;

  1. Is ActivityEnd guaranteed to trigger model inference, or are there conditions under which Gemini ignores it? We’re seeing cases where it’s sent cleanly (queue drained first, no post-boundary audio) and Gemini still doesn’t respond.
  2. Does audio arriving in the ~50–200ms after ActivityEnd corrupt the turn boundary? The genai SDK serialises messages but our audio loop runs concurrently — is there a race at the SDK’s websocket layer?
  3. Is there a known issue with gemini-2.5-flash-native-audio-preview-12-2025 and manual VAD mode?
  4. What is the recommended pattern for guaranteeing ActivityEnd is the last message Gemini receives before inference? The SDK’s send_realtime_input for audio and for activity signals appear to go through different code paths - is there a flush/sync mechanism?

hi @olaniyi_george can you DM me your project number ?

Hi @Mustan_lokhand, I have sent our project number to your dm. Do you have any insights you can share with us now that we can work with while working on a fix for this bug?