We migrated our product from Gemini 2.5 to 3.1 Flash Live and are shipping to production. The conversation quality is noticeably better than 2.5. Latency is lower, responses feel more natural, and we haven’t experienced the 1011 “Resource exhausted” disconnections that occasionally occurred on 2.5. We’re very happy with the upgrade overall.
Below are 2 model behaviors we observed during development that affect the user experience.
Our Use Case
We build a real-time voice conversation platform at joespeaking. Each session is a single continuous WebSocket connection lasting 3 to 14 minutes with 8 to 25 turns. We use native audio input/output with client-managed VAD (automaticActivityDetection disabled) and function calling for structured data extraction. Thinking level is set to minimal for lowest latency.
Issue 1: Examiner Stalls and Doesn’t Respond
Observed in session on March 28, 2026. Non-deterministic.
After the user finishes a full answer, the model sometimes goes silent and does not ask the next question. The user has to say “Hello” multiple times to re-engage the conversation. The model eventually responds with something like “I’m here. Just waiting for your complete answer. Were you finished?” even though the user had clearly finished speaking.
From our logs, the sequence is:
-
User gives a full answer (about 8 seconds of speech)
-
Client sends activityEnd
-
A brief burst of background noise is picked up by our VAD and sent as audio (about 2 seconds, producing 0 transcript characters from Gemini)
-
Client sends activityEnd again
-
Model goes silent for over 10 seconds
-
User says “Hello” twice with no response
-
Model finally responds after the user says “Yes”
The model appears to get confused by the brief noise burst and enters a waiting state, even though activityEnd was sent after it. The session does recover on its own after about 10 seconds or when the user speaks again, but the silence is disorienting.
This happened in one of our two test sessions. The other session ran perfectly with no stalls.
We’d love to hear if other developers have encountered similar turn-taking stalls, or if there are recommended patterns for sending activityEnd that help the model distinguish between “user finished” and “brief noise.”
Issue 2: Non-Deterministic Function Calling and Closing Behavior
Observed across multiple test sessions during development.
When the system prompt instructs the model to perform specific actions (call a function silently, say a scripted closing phrase exactly once), the model sometimes deviates:
Function calling: With functionCallingConfig.mode set to ANY and explicit prompt instruction to call a function silently without speaking, the model sometimes speaks the result aloud instead of calling the function, or does both. We confirmed function calling itself works. A standalone test with the same tool declaration produced a toolCall in 2 seconds. The non-determinism appears after long conversation history where the model has been generating audio for many turns and tends to continue with speech rather than switching to function calls.
Closing phrase: When prompted to say “Thank you. That is the end of the Speaking test.” the model sometimes says it twice in a single response turn and adds unsolicited text like “Have a good day.” We confirmed this is a single model turn (not duplicate prompts from our side) by adding instrumentation that verified only one prompt was sent.
Both behaviors are non-deterministic. They don’t happen every session. We handle them with client-side fallbacks so the product works correctly regardless, but the user experience is better when the model follows the instructions precisely.
We work around these with client-side fallbacks (transcript extraction, idle-wait nudge polling), so the product functions correctly. Curious if other developers have found prompt patterns that improve function calling reliability after long audio conversations.
What We Love About 3.1
The conversation quality is the biggest improvement. Responses feel more natural and contextually appropriate compared to 2.5. First turn response latency is lower. We haven’t seen any unexplained disconnections. The thinkingLevel minimal setting provides a good speed and quality balance. Function calling, when it fires correctly, produces well structured responses matching our schema.
We’re confident enough to ship this to production.
Technical Details
Our client runs in the browser with a direct WebSocket connection to Gemini Live API. Audio is PCM 16-bit mono at 16kHz input and 24kHz output. Sessions last 3 to 14 minutes with 8 to 25 turns. We use 1 function declaration with a complex schema. Context window usage reaches 3,000 to 7,000 tokens by session end.
Next Steps
We’re monitoring real user feedback in production and will report back with any patterns we observe at scale.
Before this migration, we spent 3 weeks building a WebRTC architecture (via LiveKit) to solve audio reliability and echo cancellation issues we experienced on Gemini 2.5. We’re glad that most of those issues have been resolved by the 3.1 upgrade itself. The remaining model behaviors above (turn-taking stalls, non-deterministic function calling) may still benefit from WebRTC’s more robust audio pipeline, so we plan to evaluate that path for additional stability.
Contact: Joe Hu Product: joespeaking