Gemini Live Flash 3.1 API: inputTranscription no longer streams incrementally

Hello Google AI team,

After upgrading from Gemini 2.5 to 3.1 Flash in the Live API (WebSocket), I noticed that inputTranscription is no longer delivered incrementally. It is now only delivered after the user finishes the full utterance.

For example, the behavior of gemini live 2.5 will be like:

{“serverContent”: {“inputTranscription”: {“text”: " Ho"}}}
{“serverContent”: {“inputTranscription”: {“text”: “w”}}}
{“serverContent”: {“inputTranscription”: {“text”: " are"}}}
{“serverContent”: {“inputTranscription”: {“text”: " you"}}}

This change impacts my app. My application records audio in short bursts and relies on incremental inputTranscription updates to know if the user is still speaking so it can extend the recording time. Without these real-time chunks, the recording often cuts off prematurely.

For comparison, the OpenAI Realtime API provides input_audio_buffer.speech_started and speech_stopped events for this exact purpose. However, I couldn’t find similar indicators in Gemini 3.1 Flash Live.

How would you recommend handling this change? Is there a new way to detect ongoing speech?

Thanks!

For this, you have two options:

  • Use a client-side VAD detection system that sends audio to track the start and end points.

  • Rely on transcription input and start generating a time loop between when the agent speaks and when you receive the transcription input from Gemini. You can also identify the data sent to Gemini.

Client-side VAD detection is viable and not very complex.

Hi icapora,

I use gemini live api for an android app, there are 2 ways to implement client VAD:

  1. Detect the audio input’s amplitude and set a threshold. However, this has a high hardware dependency and also influenced by background noise. For the original implementation, gemini live has a great capability for this.
  2. Use cloud VAD or embedded AI. This is more complicated and even make the app heavyweight.

I think it will be a regression if I use client VAD. Maybe I could try a lightweight cloud VAD?

Thanks for your reply!

I totally agree — Gemini’s built-in server-side VAD is excellent and implementing your own from scratch would feel like a step backward. Before going down the cloud VAD route, here are a couple of lighter options worth considering:

Option 1: Stick with Gemini’s auto VAD, but fix the audio lifecycle

In many cases, the 1011 timeout isn’t actually a VAD problem — it’s an audio stream management issue. The key thing the docs emphasize is that when you pause the audio stream for more than ~1 second (e.g., while the model is speaking back), you MUST send an audioStreamEnd event to flush the server’s audio buffer. Without it, the server-side VAD hangs waiting and eventually the WebSocket times out. Once you’re ready to send audio again, just resume — no special restart needed. This alone fixes the “Turn 2 dies” problem for many people.

Option 2: Lightweight on-device VAD (no cloud needed)

If you do need client-side VAD, there’s an excellent Android-native library specifically built for this:

:package: android-vadGitHub - gkonovalov/android-vad: Android Voice Activity Detection (VAD) library. Supports WebRTC VAD GMM, Silero VAD DNN, Yamnet VAD DNN models. · GitHub

It supports three models:

  • WebRTC VAD — Only 158 KB, extremely fast, GMM-based. Good enough for basic speech/silence detection. Low accuracy in noisy environments but almost zero overhead.
  • Silero VAD — ~2 MB ONNX model, runs via ONNX Runtime Mobile directly on-device. Much more accurate than WebRTC, handles background noise well. Each 30ms audio chunk takes <1ms to process.
  • Yamnet VAD — TFLite-based, can classify 521 audio event types. Heaviest of the three but most capable.

For your use case, I’d suggest Silero VAD via this library — it runs 100% on-device (no cloud latency or cost), it’s accurate enough to reliably detect speech vs. silence, and the overhead is minimal on modern Android devices.

Option 3: Use Gemini’s manual VAD mode with on-device detection

You can combine an on-device VAD with Gemini’s manual activity signaling. Disable the server-side auto VAD and send activityStart / activityEnd signals yourself:

config = {
    "response_modalities": ["AUDIO"],
    "realtime_input_config": {
        "automatic_activity_detection": {"disabled": True}
    }
}

// When your local VAD detects speech:
await session.send_realtime_input(activity_start=ActivityStart())
// Stream audio...
// When your local VAD detects silence:
await session.send_realtime_input(activity_end=ActivityEnd())

This gives you full control over turn-taking without relying on amplitude thresholds. The on-device Silero VAD handles the “is the user speaking?” detection, and Gemini handles everything else.

I’d recommend trying Option 1 first (it’s the simplest fix), and if you need more control, go with Option 3 + Silero via android-vad.

Hope this helps! :raising_hands:

2 Likes