I totally agree — Gemini’s built-in server-side VAD is excellent and implementing your own from scratch would feel like a step backward. Before going down the cloud VAD route, here are a couple of lighter options worth considering:
Option 1: Stick with Gemini’s auto VAD, but fix the audio lifecycle
In many cases, the 1011 timeout isn’t actually a VAD problem — it’s an audio stream management issue. The key thing the docs emphasize is that when you pause the audio stream for more than ~1 second (e.g., while the model is speaking back), you MUST send an audioStreamEnd event to flush the server’s audio buffer. Without it, the server-side VAD hangs waiting and eventually the WebSocket times out. Once you’re ready to send audio again, just resume — no special restart needed. This alone fixes the “Turn 2 dies” problem for many people.
Option 2: Lightweight on-device VAD (no cloud needed)
If you do need client-side VAD, there’s an excellent Android-native library specifically built for this:
android-vad — GitHub - gkonovalov/android-vad: Android Voice Activity Detection (VAD) library. Supports WebRTC VAD GMM, Silero VAD DNN, Yamnet VAD DNN models. · GitHub
It supports three models:
- WebRTC VAD — Only 158 KB, extremely fast, GMM-based. Good enough for basic speech/silence detection. Low accuracy in noisy environments but almost zero overhead.
- Silero VAD — ~2 MB ONNX model, runs via ONNX Runtime Mobile directly on-device. Much more accurate than WebRTC, handles background noise well. Each 30ms audio chunk takes <1ms to process.
- Yamnet VAD — TFLite-based, can classify 521 audio event types. Heaviest of the three but most capable.
For your use case, I’d suggest Silero VAD via this library — it runs 100% on-device (no cloud latency or cost), it’s accurate enough to reliably detect speech vs. silence, and the overhead is minimal on modern Android devices.
Option 3: Use Gemini’s manual VAD mode with on-device detection
You can combine an on-device VAD with Gemini’s manual activity signaling. Disable the server-side auto VAD and send activityStart / activityEnd signals yourself:
config = {
"response_modalities": ["AUDIO"],
"realtime_input_config": {
"automatic_activity_detection": {"disabled": True}
}
}
// When your local VAD detects speech:
await session.send_realtime_input(activity_start=ActivityStart())
// Stream audio...
// When your local VAD detects silence:
await session.send_realtime_input(activity_end=ActivityEnd())
This gives you full control over turn-taking without relying on amplitude thresholds. The on-device Silero VAD handles the “is the user speaking?” detection, and Gemini handles everything else.
I’d recommend trying Option 1 first (it’s the simplest fix), and if you need more control, go with Option 3 + Silero via android-vad.
Hope this helps! 