Gemini Live (manual VAD, WebRTC): 1011 keepalive timeout after ActivityEnd — is inference guaranteed to trigger?

Stack: gemini-2.5-flash-native-audio-preview-12-2025 · Python google-genai SDK · WebRTC audio bridge via aiortc · Manual VAD (automatic_activity_detection: disabled) · thinking_budget: 0 · 4 custom tool declarations

What we’re building

We’re developing a real-time AI interview platform that routes candidate audio through a WebRTC pipeline: browser microphone → aiortc WebRTC track → server-side resampling from 48kHz down to 16kHz → send_realtime_input(audio=...) into a live Gemini session. Because we need precise control over turn boundaries, we disabled Gemini’s built-in VAD entirely and implemented our own amplitude-based voice activity detection on the server side. When our VAD detects speech onset, we fire send_realtime_input(activity_start=ActivityStart()); when it detects silence long enough to call an utterance complete, we fire send_realtime_input(activity_end=ActivityEnd()).

The Gemini session initialises correctly — the agent delivers its opening greeting and turn_complete fires as expected. The problem surfaces the moment the candidate starts speaking. Without exception, the session terminates with:

websockets.exceptions.ConnectionClosedError: sent 1011 (internal error)
keepalive ping timeout; no close frame received

Two distinct failure patterns

Failure AActivityEnd is sent but Gemini never responds. Audio delivery stops, no turn_complete event arrives, and the connection dies with a 1011 error roughly 10–12 seconds later. The signal reached Gemini, but inference never triggered.

Failure B (dominant) — Our VAD detects speech onset and begins streaming audio. The candidate speaks for anywhere between 30 and 60 seconds, then stops — but our VAD never accumulates enough silence to emit ActivityEnd. The root cause: once the candidate goes quiet, track.recv() on the WebRTC side stops returning frames altogether rather than delivering silent frames. With no incoming audio, our silence counter never advances, ActivityEnd never gets sent, and Gemini is left holding an open turn indefinitely. The 1011 follows ~10–12 seconds later.

Representative log for Failure B:

14:05:43 [GeminiLive] turn_complete — agent greeted candidate
14:05:44 [Bridge] VAD unlocked — mic ready
14:05:45 [Bridge] VAD: speech START
14:05:45 [Connection] Candidate activity START → send_activity_start() called
... 1900 audio chunks sent over ~38 seconds ...
14:06:23 last audio chunk logged
--- 11 seconds of silence ---
14:06:34 [Agent] ERROR: sent 1011 (internal error) keepalive ping timeout

No VAD: speech END log appears. ActivityEnd is never dispatched.


What we’ve already tried

  1. Removed the send_audio_stream_end() call — we’d been calling it as a flush signal after ActivityEnd, based on a community tip. The Vertex AI docs clarify that in manual VAD mode, AudioStreamEnd shouldn’t be sent at all — ActivityEnd is the sole boundary marker. Removing it improved stability but didn’t resolve either failure mode.

  2. Queue drain before ActivityEnd — we now flush any buffered audio out of our inbound queue before dispatching ActivityEnd, to avoid stale audio crossing the turn boundary.

  3. Treating track.recv() timeouts as silence — we modified our VAD so that WebRTC frame delivery timeouts count toward the silence accumulator. This partially addresses Failure B by allowing speech end to fire even when the track goes dead instead of silent.

  4. _activity_ended guard on the audio send loop — a flag that blocks further audio chunks from being dispatched after ActivityEnd has been sent.

Despite all of these mitigations, sessions are still dying. Our outstanding questions:

  1. Is ActivityEnd guaranteed to trigger model inference in all cases, or are there edge conditions where Gemini silently ignores it? We’ve observed failures even when the queue was fully drained and no post-boundary audio was present.

  2. Can audio arriving in the 50–200ms window after ActivityEnd corrupt the turn boundary? Our audio loop and the activity signal dispatch run concurrently — is there a race condition at the SDK’s WebSocket serialisation layer?

  3. Are there any known issues with gemini-2.5-flash-native-audio-preview-12-2025 specifically in manual VAD mode?

  4. What is the correct pattern for guaranteeing that ActivityEnd is the absolute last message Gemini receives before it begins inference? The audio send path and the activity signal path in send_realtime_input appear to follow separate code routes through the SDK — is there an explicit flush or synchronisation mechanism we should be using?