We’re using gemini-3.1-flash-live-preview via raw WebSocket (generativelanguage.googleapis.com) for a marine diagnostic voice assistant. A single real-world session on Android Chrome over Verizon — with the network flapping between 5G and 4G, as happens on a boat offshore — triggered what looks like a cluster of related failures. I’d like to understand what’s actually happening and whether there’s a supported mitigation path.
Environment
-
Model:
gemini-3.1-flash-live-preview -
Transport: raw WebSocket (not the SDK wrapper)
-
Client: React web app running in an Android Chrome WebView (Capacitor)
-
Network: Verizon cellular, observed handoff between 5G and 4G during the session
-
Session length at failure: ~30 minutes
-
sessionResumption: not enabled -
contextWindowCompression: not enabled -
safety_settings: not explicitly set (relying on defaults) -
speech_config.language_code: not set (per docs, native audio models auto-detect) -
System instruction: ~325 lines
What happened, in order
-
Session occasionally would not initialize on first connection attempt; a page refresh resolved it. (I suspect this is an AudioContext user-gesture issue on our side, noting it for completeness.)
-
Around the 10–15 minute mark, mid-session, the model’s TTS output switched from English to Portuguese (pt-BR). The user had not spoken Portuguese. Audio input during that period was degraded by the cellular handoff — I assume silent frames or packet loss caused the native audio model’s language detector to reclassify.
-
A few turns later, the user said an engine model number (“6LPASTP”) aloud. The model responded verbatim with “I’m just a language model and can’t help with that.” — Gemini’s own canned refusal, not anything in our codebase. We confirmed via grep that this string does not exist anywhere in our backend or frontend.
-
That refusal then persisted for 7 consecutive turns, regardless of the user’s input — including simple greetings and attempts to re-engage in Portuguese. The session was effectively locked.
-
Our LangSmith traces show the refusal was written into our LangGraph conversation state via
graph.aupdate_state(...)from the GeminioutputTranscriptionstream. Once it was in state, subsequent turns saw “a recent assistant refusal” as recent history, which compounds the stuck pattern.
What I think is happening (please correct me)
-
The language switch is documented behavior — native audio models auto-detect, and we’re not constraining language in the system instruction. Fix: add an explicit English constraint in
system_instruction. I believe this is the supported path sincespeech_config.language_codeisn’t honored on native audio. Confirming? -
The canned refusal appears to come from Google’s non-adjustable core safety layer (not the four adjustable harm categories, which default to OFF per the Gemini 2.5/3.x docs). If that’s correct,
safety_settings = BLOCK_NONEwouldn’t help here. Is that right? -
The self-reinforcement isn’t the API’s fault — that’s our architecture writing Gemini’s output into our own state and re-injecting it as history. The community-recommended fix (“start a new chat”) matches what we’re planning: detect canned-refusal patterns in
outputTranscription, don’t persist them into state, tear down the session, and re-establish with a fresh WebSocket.
What I’d like to confirm or learn
-
Language pinning on native audio: is system-instruction-level constraint the only supported way to keep the model responding in English? I’ve seen conflicting guidance in Google docs vs third-party SDK wrappers.
-
Safety layer introspection: when the non-adjustable safety layer fires, is there any signal in the response stream that a developer can detect (beyond string-matching the refusal)? A
promptFeedback.blockReasonor equivalent would let us handle it deterministically instead of pattern-matching. Doesgemini-3.1-flash-live-previewemit any such signal over the Live WebSocket? -
Cellular handoff robustness: the 15-minute audio-only session cap matters here — we likely hit it during the incident session, independent of the cellular flap. Google’s docs recommend
contextWindowCompressionandsessionResumptionfor long sessions and for “switching from Wi-Fi to 5G.” Are those the primary recommended mitigations for mobile-cellular use cases, or is there a more foundational pattern I’m missing (for example, is raw WebSocket the wrong transport choice for browser clients, and should we be using WebRTC instead)? -
Session resumption + long system instructions: I saw a prior forum thread noting that
sessionResumptionstops working with system instructions around 200 tokens. Our system prompt is ~325 lines. Is that still a known limitation? If so, is there a recommended pattern for Live applications that need a rich system prompt — move context into tool inputs, shorten the instruction, or something else? -
Refusal recovery: is there a supported way to “reset” the Gemini safety classifier within a session (e.g., a message role or instruction that tells it to stop treating the recent refusal as context), or is tearing down and re-establishing the WebSocket session the only reliable path?
Any pointers to undocumented patterns, model-team guidance, or community fixes would be appreciated. Happy to share more trace detail if useful, I have LangSmith captures of the full thread.
Thanks.
