Gemini Live Flash 3.1 API: inputTranscription no longer streams incrementally

TsingYue_Liu · March 30, 2026, 7:04am

Hello Google AI team,

After upgrading from Gemini 2.5 to 3.1 Flash in the Live API (WebSocket), I noticed that inputTranscription is no longer delivered incrementally. It is now only delivered after the user finishes the full utterance.

For example, the behavior of gemini live 2.5 will be like:

{“serverContent”: {“inputTranscription”: {“text”: " Ho"}}}
{“serverContent”: {“inputTranscription”: {“text”: “w”}}}
{“serverContent”: {“inputTranscription”: {“text”: " are"}}}
{“serverContent”: {“inputTranscription”: {“text”: " you"}}}

This change impacts my app. My application records audio in short bursts and relies on incremental inputTranscription updates to know if the user is still speaking so it can extend the recording time. Without these real-time chunks, the recording often cuts off prematurely.

For comparison, the OpenAI Realtime API provides input_audio_buffer.speech_started and speech_stopped events for this exact purpose. However, I couldn’t find similar indicators in Gemini 3.1 Flash Live.

How would you recommend handling this change? Is there a new way to detect ongoing speech?

Thanks!

icapora · March 30, 2026, 8:37pm

For this, you have two options:

Use a client-side VAD detection system that sends audio to track the start and end points.
Rely on transcription input and start generating a time loop between when the agent speaks and when you receive the transcription input from Gemini. You can also identify the data sent to Gemini.

Client-side VAD detection is viable and not very complex.

TsingYue_Liu · March 31, 2026, 12:44am

Hi icapora,

I use gemini live api for an android app, there are 2 ways to implement client VAD:

Detect the audio input’s amplitude and set a threshold. However, this has a high hardware dependency and also influenced by background noise. For the original implementation, gemini live has a great capability for this.
Use cloud VAD or embedded AI. This is more complicated and even make the app heavyweight.

I think it will be a regression if I use client VAD. Maybe I could try a lightweight cloud VAD?

Thanks for your reply!

icapora · April 1, 2026, 11:50am

I totally agree — Gemini’s built-in server-side VAD is excellent and implementing your own from scratch would feel like a step backward. Before going down the cloud VAD route, here are a couple of lighter options worth considering:

Option 1: Stick with Gemini’s auto VAD, but fix the audio lifecycle

In many cases, the 1011 timeout isn’t actually a VAD problem — it’s an audio stream management issue. The key thing the docs emphasize is that when you pause the audio stream for more than ~1 second (e.g., while the model is speaking back), you MUST send an audioStreamEnd event to flush the server’s audio buffer. Without it, the server-side VAD hangs waiting and eventually the WebSocket times out. Once you’re ready to send audio again, just resume — no special restart needed. This alone fixes the “Turn 2 dies” problem for many people.

Option 2: Lightweight on-device VAD (no cloud needed)

If you do need client-side VAD, there’s an excellent Android-native library specifically built for this:

android-vad — GitHub - gkonovalov/android-vad: Android Voice Activity Detection (VAD) library. Supports WebRTC VAD GMM, Silero VAD DNN, Yamnet VAD DNN models. · GitHub

It supports three models:

WebRTC VAD — Only 158 KB, extremely fast, GMM-based. Good enough for basic speech/silence detection. Low accuracy in noisy environments but almost zero overhead.
Silero VAD — ~2 MB ONNX model, runs via ONNX Runtime Mobile directly on-device. Much more accurate than WebRTC, handles background noise well. Each 30ms audio chunk takes <1ms to process.
Yamnet VAD — TFLite-based, can classify 521 audio event types. Heaviest of the three but most capable.

For your use case, I’d suggest Silero VAD via this library — it runs 100% on-device (no cloud latency or cost), it’s accurate enough to reliably detect speech vs. silence, and the overhead is minimal on modern Android devices.

Option 3: Use Gemini’s manual VAD mode with on-device detection

You can combine an on-device VAD with Gemini’s manual activity signaling. Disable the server-side auto VAD and send activityStart / activityEnd signals yourself:

config = {
    "response_modalities": ["AUDIO"],
    "realtime_input_config": {
        "automatic_activity_detection": {"disabled": True}
    }
}

// When your local VAD detects speech:
await session.send_realtime_input(activity_start=ActivityStart())
// Stream audio...
// When your local VAD detects silence:
await session.send_realtime_input(activity_end=ActivityEnd())

This gives you full control over turn-taking without relying on amplitude thresholds. The on-device Silero VAD handles the “is the user speaking?” detection, and Gemini handles everything else.

I’d recommend trying Option 1 first (it’s the simplest fix), and if you need more control, go with Option 3 + Silero via android-vad.

Hope this helps!

TsingYue_Liu · April 23, 2026, 1:22am

Hi @icapora,

I’ve got a new finding for gemini 3.1 live api. Now when the voice is detected, I will receive a simple message “serverContent”: {}, and I can take this message as the “speech start” signal. Combined with the original “final transcription“ signal, I can now determine a full voice activity interval and the issue seemed to be solved. Not sure if this is the recent update, but this helped! Just an update to this article.

Thanks for your support again!

Topic		Replies	Views
How do I prevent the Live API from discarding audio when it's given audio while it speaks? Gemini API api , gemini-api	12	571	April 27, 2026
Gemini Live (manual VAD, WebRTC): 1011 keepalive timeout after ActivityEnd — is inference guaranteed to trigger? Gemini API python	0	26	April 29, 2026
Manual Activity Detection: Second turn not processed - am I missing something? Gemini API api , gemini , live-streaming	0	60	February 9, 2026
Gemini-2.5-flash-native-audio-preview with manual VAD (disabled: True) - Gemini never responds after ActivityEnd, session dies with 1011 keepalive ping timeout Gemini API gemini-flash-2-5	2	61	April 24, 2026
Gemini native audio gemini-live-2.5-flash-preview model Speaking issue (Pause time issue) Google AI Studio ai-studio , api , vertexai , gemini-flash-2-5	2	207	January 22, 2026

Gemini Live Flash 3.1 API: inputTranscription no longer streams incrementally

Related topics