Stack: gemini-2.5-flash-native-audio-preview-12-2025 · Python google-genai SDK · WebRTC audio bridge via aiortc · Manual VAD (automatic_activity_detection: disabled) · thinking_budget: 0 · 4 custom tool declarations
What we’re building
We’re developing a real-time AI interview platform that routes candidate audio through a WebRTC pipeline: browser microphone → aiortc WebRTC track → server-side resampling from 48kHz down to 16kHz → send_realtime_input(audio=...) into a live Gemini session. Because we need precise control over turn boundaries, we disabled Gemini’s built-in VAD entirely and implemented our own amplitude-based voice activity detection on the server side. When our VAD detects speech onset, we fire send_realtime_input(activity_start=ActivityStart()); when it detects silence long enough to call an utterance complete, we fire send_realtime_input(activity_end=ActivityEnd()).
The Gemini session initialises correctly — the agent delivers its opening greeting and turn_complete fires as expected. The problem surfaces the moment the candidate starts speaking. Without exception, the session terminates with:
websockets.exceptions.ConnectionClosedError: sent 1011 (internal error)
keepalive ping timeout; no close frame received
Two distinct failure patterns
Failure A — ActivityEnd is sent but Gemini never responds. Audio delivery stops, no turn_complete event arrives, and the connection dies with a 1011 error roughly 10–12 seconds later. The signal reached Gemini, but inference never triggered.
Failure B (dominant) — Our VAD detects speech onset and begins streaming audio. The candidate speaks for anywhere between 30 and 60 seconds, then stops — but our VAD never accumulates enough silence to emit ActivityEnd. The root cause: once the candidate goes quiet, track.recv() on the WebRTC side stops returning frames altogether rather than delivering silent frames. With no incoming audio, our silence counter never advances, ActivityEnd never gets sent, and Gemini is left holding an open turn indefinitely. The 1011 follows ~10–12 seconds later.
Representative log for Failure B:
14:05:43 [GeminiLive] turn_complete — agent greeted candidate
14:05:44 [Bridge] VAD unlocked — mic ready
14:05:45 [Bridge] VAD: speech START
14:05:45 [Connection] Candidate activity START → send_activity_start() called
... 1900 audio chunks sent over ~38 seconds ...
14:06:23 last audio chunk logged
--- 11 seconds of silence ---
14:06:34 [Agent] ERROR: sent 1011 (internal error) keepalive ping timeout
No VAD: speech END log appears. ActivityEnd is never dispatched.
What we’ve already tried
-
Removed the
send_audio_stream_end()call — we’d been calling it as a flush signal afterActivityEnd, based on a community tip. The Vertex AI docs clarify that in manual VAD mode,AudioStreamEndshouldn’t be sent at all —ActivityEndis the sole boundary marker. Removing it improved stability but didn’t resolve either failure mode. -
Queue drain before
ActivityEnd— we now flush any buffered audio out of our inbound queue before dispatchingActivityEnd, to avoid stale audio crossing the turn boundary. -
Treating
track.recv()timeouts as silence — we modified our VAD so that WebRTC frame delivery timeouts count toward the silence accumulator. This partially addresses Failure B by allowing speech end to fire even when the track goes dead instead of silent. -
_activity_endedguard on the audio send loop — a flag that blocks further audio chunks from being dispatched afterActivityEndhas been sent.
Despite all of these mitigations, sessions are still dying. Our outstanding questions:
-
Is
ActivityEndguaranteed to trigger model inference in all cases, or are there edge conditions where Gemini silently ignores it? We’ve observed failures even when the queue was fully drained and no post-boundary audio was present. -
Can audio arriving in the 50–200ms window after
ActivityEndcorrupt the turn boundary? Our audio loop and the activity signal dispatch run concurrently — is there a race condition at the SDK’s WebSocket serialisation layer? -
Are there any known issues with
gemini-2.5-flash-native-audio-preview-12-2025specifically in manual VAD mode? -
What is the correct pattern for guaranteeing that
ActivityEndis the absolute last message Gemini receives before it begins inference? The audio send path and the activity signal path insend_realtime_inputappear to follow separate code routes through the SDK — is there an explicit flush or synchronisation mechanism we should be using?