Gemini 3.1 Flash Live — Great upgrade from 2.5, two model behavior observations from production voice app

We migrated our product from Gemini 2.5 to 3.1 Flash Live and are shipping to production. The conversation quality is noticeably better than 2.5. Latency is lower, responses feel more natural, and we haven’t experienced the 1011 “Resource exhausted” disconnections that occasionally occurred on 2.5. We’re very happy with the upgrade overall.

Below are 2 model behaviors we observed during development that affect the user experience.

Our Use Case

We build a real-time voice conversation platform at joespeaking. Each session is a single continuous WebSocket connection lasting 3 to 14 minutes with 8 to 25 turns. We use native audio input/output with client-managed VAD (automaticActivityDetection disabled) and function calling for structured data extraction. Thinking level is set to minimal for lowest latency.

Issue 1: Examiner Stalls and Doesn’t Respond

Observed in session on March 28, 2026. Non-deterministic.

After the user finishes a full answer, the model sometimes goes silent and does not ask the next question. The user has to say “Hello” multiple times to re-engage the conversation. The model eventually responds with something like “I’m here. Just waiting for your complete answer. Were you finished?” even though the user had clearly finished speaking.

From our logs, the sequence is:

  1. User gives a full answer (about 8 seconds of speech)

  2. Client sends activityEnd

  3. A brief burst of background noise is picked up by our VAD and sent as audio (about 2 seconds, producing 0 transcript characters from Gemini)

  4. Client sends activityEnd again

  5. Model goes silent for over 10 seconds

  6. User says “Hello” twice with no response

  7. Model finally responds after the user says “Yes”

The model appears to get confused by the brief noise burst and enters a waiting state, even though activityEnd was sent after it. The session does recover on its own after about 10 seconds or when the user speaks again, but the silence is disorienting.

This happened in one of our two test sessions. The other session ran perfectly with no stalls.

We’d love to hear if other developers have encountered similar turn-taking stalls, or if there are recommended patterns for sending activityEnd that help the model distinguish between “user finished” and “brief noise.”

Issue 2: Non-Deterministic Function Calling and Closing Behavior

Observed across multiple test sessions during development.

When the system prompt instructs the model to perform specific actions (call a function silently, say a scripted closing phrase exactly once), the model sometimes deviates:

Function calling: With functionCallingConfig.mode set to ANY and explicit prompt instruction to call a function silently without speaking, the model sometimes speaks the result aloud instead of calling the function, or does both. We confirmed function calling itself works. A standalone test with the same tool declaration produced a toolCall in 2 seconds. The non-determinism appears after long conversation history where the model has been generating audio for many turns and tends to continue with speech rather than switching to function calls.

Closing phrase: When prompted to say “Thank you. That is the end of the Speaking test.” the model sometimes says it twice in a single response turn and adds unsolicited text like “Have a good day.” We confirmed this is a single model turn (not duplicate prompts from our side) by adding instrumentation that verified only one prompt was sent.

Both behaviors are non-deterministic. They don’t happen every session. We handle them with client-side fallbacks so the product works correctly regardless, but the user experience is better when the model follows the instructions precisely.

We work around these with client-side fallbacks (transcript extraction, idle-wait nudge polling), so the product functions correctly. Curious if other developers have found prompt patterns that improve function calling reliability after long audio conversations.

What We Love About 3.1

The conversation quality is the biggest improvement. Responses feel more natural and contextually appropriate compared to 2.5. First turn response latency is lower. We haven’t seen any unexplained disconnections. The thinkingLevel minimal setting provides a good speed and quality balance. Function calling, when it fires correctly, produces well structured responses matching our schema.

We’re confident enough to ship this to production.

Technical Details

Our client runs in the browser with a direct WebSocket connection to Gemini Live API. Audio is PCM 16-bit mono at 16kHz input and 24kHz output. Sessions last 3 to 14 minutes with 8 to 25 turns. We use 1 function declaration with a complex schema. Context window usage reaches 3,000 to 7,000 tokens by session end.

Next Steps

We’re monitoring real user feedback in production and will report back with any patterns we observe at scale.

Before this migration, we spent 3 weeks building a WebRTC architecture (via LiveKit) to solve audio reliability and echo cancellation issues we experienced on Gemini 2.5. We’re glad that most of those issues have been resolved by the 3.1 upgrade itself. The remaining model behaviors above (turn-taking stalls, non-deterministic function calling) may still benefit from WebRTC’s more robust audio pipeline, so we plan to evaluate that path for additional stability.

Contact: Joe Hu Product: joespeaking

2 Likes

Nearly same feedback

1 Like

Hi Joe, thanks for sharing these great insights! (Please note: I am not very fluent in English, so I used AI to help me write this!)

I’m currently building a Python-based voice app using the google-genai SDK and pyaudio with gemini-3.1-flash-live-preview, and I’ve been wrestling with a frustrating issue for the past 3 hours. I’m hoping you might have some advice.

I am hitting the exact 1011 keepalive ping timeout wall you mentioned. Here is my exact situation:

  • Turn 1 works perfectly: The model hears me, processes the speech, and responds beautifully.

  • Turn 2 dies: After the model finishes speaking its first response, it completely stops responding to my next input. After about 70 seconds of dead air, the session crashes with sent 1011 (internal error) keepalive ping timeout; no close frame received.

What I’ve tried (using earphones to prevent echo):

  1. Sending a continuous raw audio stream without any interruptions.

  2. Implementing manual VAD (sending silence/0s when the user is not speaking, or muting the mic while the model speaks).

  3. Using asyncio.Queue and thread-safe callbacks to ensure the Python event loop isn’t blocked by audio I/O.

None of these solved the “Turn 2 timeout” issue. Did you encounter this specific behavior where the first turn works but subsequent turns hang and time out?

How exactly are you managing the microphone stream while the model is playing back audio to keep the websocket alive and the VAD happy? Any advice or code snippets would be an absolute lifesaver!

2 Likes

Welcome to the community, Namu! :waving_hand: Great first post — and your English is perfectly clear, no worries there.

The “Turn 1 works, Turn 2 dies” pattern you’re describing is one of the most common pitfalls with the Gemini Live API, and it’s almost always related to how the audio stream is managed between turns. Here are some things to check:

1. Send audioStreamEnd when you pause the mic

This is the #1 thing people miss. The official docs are very clear about this: when the audio stream is paused for more than a second (e.g., while the model is speaking back), you MUST send an audioStreamEnd event to flush any cached audio on the server side. Without it, the server’s VAD gets stuck waiting for the end of your “utterance” and eventually times out.

With the google-genai SDK:

await session.send_realtime_input(audio_stream_end=True)

Then, when you’re ready to send audio again for Turn 2, just resume sending audio blobs normally — no special “restart” signal is needed.

2. Make sure your receive() loop runs concurrently with your send() loop

A common mistake is to stop consuming server messages while waiting for the user to speak again. The receive loop must be running at ALL times via asyncio tasks — if you block it or let it fall behind, the WebSocket keepalive pings won’t be answered and you’ll get the 1011 timeout.

Basic pattern:

async with client.aio.live.connect(model=model, config=config) as session:
    await asyncio.gather(
        send_audio_task(session),
        receive_response_task(session)
    )

3. Don’t send silence/zeros — send audioStreamEnd instead

You mentioned sending silence (0s) when the user isn’t speaking. This can confuse the server-side VAD. The correct pattern is: stream real audio → send audioStreamEnd when the user stops → resume streaming when the user starts talking again.

4. Consider disabling automatic VAD for more control

If you want full control over turn-taking, you can disable the built-in VAD and manually signal activity:

config = {
    "response_modalities": ["AUDIO"],
    "realtime_input_config": {
        "automatic_activity_detection": {"disabled": True}
    },
}
# Then manually send:
await session.send_realtime_input(activity_start=types.ActivityStart())
# ... stream audio ...
await session.send_realtime_input(activity_end=types.ActivityEnd())

5. Check out the official example repo

Google has a working reference implementation that handles all of this correctly:

It’s a minimal command-line app that streams mic audio and plays back responses using the Gen AI SDK. I’d recommend using it as your baseline and comparing your audio lifecycle management against theirs.

Also worth noting: the 1011 error can sometimes be a transient server-side issue on Google’s end (there are several open issues on GitHub about this). But the Turn 1 OK → Turn 2 dead pattern specifically points to the audioStreamEnd / receive loop issue.

Hope this helps — let us know how it goes! :rocket:

2 Likes

Great advice from @icapora above. We do exactly that (manual VAD with automaticActivityDetection: { disabled: true }). Here are a few additional things we learned shipping this to production that might help with your Turn 2 issue:

Don’t send audio between activityEnd and the next activityStart.

This was our biggest lesson. After sending activityEnd, we suppress all outbound audio until the next activityStart. We observed 1007 (precondition failed) disconnects when audio leaked through in that gap. In our implementation, the mic hardware stays running but we short-circuit the send callback with a flag.

Echo cancellation matters more than you’d expect.

The default Live API behavior is START_OF_ACTIVITY_INTERRUPTS — if the model’s own audio output leaks back through the mic, it can trigger a barge-in and confuse the session state. We enable echoCancellation: true, noiseSuppression: true, and autoGainControl: true. With earphones this is less of an issue, but worth checking your PyAudio setup isn’t feeding playback audio back into the input stream.

Session resumption for handling 1011s.

We still see occasional 1011 disconnects on 3.1 — less frequent than 2.5 but not zero. Our strategy is auto-reconnect using session resumption tokens (sessionResumptionUpdate messages that Gemini sends during the session). Exponential backoff: 1s, 2s, 4s. For your Python client, store the newHandle from these messages and reconnect with it if the session drops.

See: https://ai.google.dev/gemini-api/docs/live-api/session-management

One 3.1-specific gotcha: don’t use periodic flush.

If you’re thinking of sending periodic activityEnd/activityStart to keep the session alive — don’t on 3.1. It interprets activityEnd as “user is done speaking” and responds prematurely, cutting off longer speech. This worked fine on 2.5 but breaks conversations on 3.1.

For your specific Turn 2 issue: I’d check whether your receive loop is truly running concurrently (as icapora mentioned) and whether any audio is being sent in the gap between turns. Those two things together cause the exact pattern you’re describing.

1 Like

Thanks! Could you see the below reply?

My team is running into the intermittent stalling issue too. We’re using gemini-live-2.5-flash-native-audio on Vertex AI with auto VAD enabled, and the model will randomly stall after a user input. Seems to occur after both user input via client content and real-time input. A few things we’ve noticed is that when this occurs:

  • It takes between 1-3 additional turns of user input before the model responds again
  • Gemini still provides input transcription
  • Response token count sits around 1-6 tokens and only produces text (per candidatesTokensDetails sent in the usage metadata

We’ve only recently (about two weeks ago) encountered this issue and are at a dead end on troubleshooting. Hope this information is helpful for other developers who are facing the same issue and would love to hear if anyone has found a robust solution or even a temporary work around for this.

1 Like

3.1 flash lite before 3.1 flash… makes me contemplate the nature of the models training…

Makes me think that one is nonsensically over the ther…

really i dont understand - unless theres a pattern that i can infer in teh grander context mathematically.
… /

shrug.
will we ever stop this came?

We can’t say for sure it’s the exact same underlying issue, but 2.5 definitely had enough reliability problems for us that we spent a lot of time trying to stabilize it.

On our side, the main documented 2.5 problems were mid-session 1011 / 1008 disconnects, transcription reliability issues during longer speech, and general turn-management / instruction-following instability. We tried multiple times to make 2.5 robust enough for production, and before migrating we even spent about 3 weeks exploring a WebRTC architecture via LiveKit because we suspected part of the problem might be in the audio pipeline or transport layer.

What ultimately changed the outcome for us was moving off 2.5. We migrated to 3.1 Flash Live and also tightened our client-side turn handling, and that was the first setup that felt materially more stable and production-ready for us.

If migration is possible in your environment, I’d strongly recommend testing 3.1 as well.

It is exactly the same thing that I’m experiencing
Joe Hu , do you want to discuss it with me?
I can give my personnal feedback also