Model: gemini-2.5-flash-native-audio-preview-12-2025 SDK: Python google-genai, WebRTC audio bridge via aiortc VAD mode: Manual -automatic_activity_detection: { disabled: True } Thinking: thinking_budget: 0 Tools: Yes - 4 custom function declarations
Setup
We’re building an AI interview platform over WebRTC. The audio pipeline is:
-
Browser mic → aiortc WebRTC track → resampled 48kHz→16kHz →
send_realtime_input(audio=...) -
We disabled auto-VAD and implemented our own amplitude-based VAD on the server
-
On speech start:
send_realtime_input(activity_start=ActivityStart()) -
On speech end:
send_realtime_input(activity_end=ActivityEnd())
The Gemini session opens, the agent greets the candidate (turn_complete fires correctly), and then the candidate speaks. After this the session consistently dies with:
websockets.exceptions.ConnectionClosedError: sent 1011 (internal error) keepalive ping timeout; no close frame received
Two failure modes we’ve observed
Mode A: VAD fires speech END, ActivityEnd is sent, but Gemini never responds. Audio chunks stop, no turn_complete ever arrives, 1011 after ~10–12 seconds.
Mode B (more common): VAD speech START fires, audio streams for 30–60 seconds, but VAD speech END NEVER fires because when the candidate stops talking, track.recv() times out (no frames) rather than delivering silence - so our VAD silence counter never accumulates enough to trigger ActivityEnd. Gemini sits waiting. 1011 after ~10–12 seconds.
Log excerpt (Mode B - no speech END)
14:05:43 [GeminiLive] turn_complete — agent greeted candidate
14:05:44 [Bridge] VAD unlocked — mic ready
14:05:45 [Bridge] VAD: speech START
14:05:45 [Connection] Candidate activity START → send_activity_start() called
... 1900 chunks sent over ~38 seconds ...
14:06:23 last audio chunk logged
(11 seconds of nothing)
14:06:34 [Agent] ERROR: sent 1011 (internal error) keepalive ping timeout
No VAD: speech END log. No ActivityEnd ever sent. Gemini waits forever.
What we’ve tried / discovered
-
Removed
send_audio_stream_end()afterActivityEnd- we had been calling it as a “flush” afterActivityEnd, based on a community suggestion. Per the Vertex AI reference docs, “AnAudioStreamEndisn’t sent in this configuration. Instead, any interruption of the stream is marked by anActivityEndmessage.” Removing it didn’t fully resolve the issue. -
Draining the inbound queue before
ActivityEnd- implemented a queue drain before sendingActivityEndto prevent post-boundary audio from reaching Gemini. -
Track-timeout silence counting - modified our VAD to count WebRTC track timeouts as silence accumulation, so speech END fires even when
track.recv()stops delivering frames. This partially helps Mode B. -
_activity_endedflag onsend_audio- blocks the audio send loop from sending chunks afterActivityEnd.
I will appreciate any guidance with these;
- Is
ActivityEndguaranteed to trigger model inference, or are there conditions under which Gemini ignores it? We’re seeing cases where it’s sent cleanly (queue drained first, no post-boundary audio) and Gemini still doesn’t respond. - Does audio arriving in the ~50–200ms after
ActivityEndcorrupt the turn boundary? The genai SDK serialises messages but our audio loop runs concurrently — is there a race at the SDK’s websocket layer? - Is there a known issue with
gemini-2.5-flash-native-audio-preview-12-2025and manual VAD mode? - What is the recommended pattern for guaranteeing
ActivityEndis the last message Gemini receives before inference? The SDK’ssend_realtime_inputfor audio and for activity signals appear to go through different code paths - is there a flush/sync mechanism?