Gemini 3.1 Flash Live Preview, cascade failure on mobile cellular (5G↔4G handoff): language switch, persistent canned refusal, session lockup

We’re using gemini-3.1-flash-live-preview via raw WebSocket (generativelanguage.googleapis.com) for a marine diagnostic voice assistant. A single real-world session on Android Chrome over Verizon — with the network flapping between 5G and 4G, as happens on a boat offshore — triggered what looks like a cluster of related failures. I’d like to understand what’s actually happening and whether there’s a supported mitigation path.

Environment

  • Model: gemini-3.1-flash-live-preview

  • Transport: raw WebSocket (not the SDK wrapper)

  • Client: React web app running in an Android Chrome WebView (Capacitor)

  • Network: Verizon cellular, observed handoff between 5G and 4G during the session

  • Session length at failure: ~30 minutes

  • sessionResumption: not enabled

  • contextWindowCompression: not enabled

  • safety_settings: not explicitly set (relying on defaults)

  • speech_config.language_code: not set (per docs, native audio models auto-detect)

  • System instruction: ~325 lines

What happened, in order

  1. Session occasionally would not initialize on first connection attempt; a page refresh resolved it. (I suspect this is an AudioContext user-gesture issue on our side, noting it for completeness.)

  2. Around the 10–15 minute mark, mid-session, the model’s TTS output switched from English to Portuguese (pt-BR). The user had not spoken Portuguese. Audio input during that period was degraded by the cellular handoff — I assume silent frames or packet loss caused the native audio model’s language detector to reclassify.

  3. A few turns later, the user said an engine model number (“6LPASTP”) aloud. The model responded verbatim with “I’m just a language model and can’t help with that.” — Gemini’s own canned refusal, not anything in our codebase. We confirmed via grep that this string does not exist anywhere in our backend or frontend.

  4. That refusal then persisted for 7 consecutive turns, regardless of the user’s input — including simple greetings and attempts to re-engage in Portuguese. The session was effectively locked.

  5. Our LangSmith traces show the refusal was written into our LangGraph conversation state via graph.aupdate_state(...) from the Gemini outputTranscription stream. Once it was in state, subsequent turns saw “a recent assistant refusal” as recent history, which compounds the stuck pattern.

What I think is happening (please correct me)

  • The language switch is documented behavior — native audio models auto-detect, and we’re not constraining language in the system instruction. Fix: add an explicit English constraint in system_instruction. I believe this is the supported path since speech_config.language_code isn’t honored on native audio. Confirming?

  • The canned refusal appears to come from Google’s non-adjustable core safety layer (not the four adjustable harm categories, which default to OFF per the Gemini 2.5/3.x docs). If that’s correct, safety_settings = BLOCK_NONE wouldn’t help here. Is that right?

  • The self-reinforcement isn’t the API’s fault — that’s our architecture writing Gemini’s output into our own state and re-injecting it as history. The community-recommended fix (“start a new chat”) matches what we’re planning: detect canned-refusal patterns in outputTranscription, don’t persist them into state, tear down the session, and re-establish with a fresh WebSocket.

What I’d like to confirm or learn

  1. Language pinning on native audio: is system-instruction-level constraint the only supported way to keep the model responding in English? I’ve seen conflicting guidance in Google docs vs third-party SDK wrappers.

  2. Safety layer introspection: when the non-adjustable safety layer fires, is there any signal in the response stream that a developer can detect (beyond string-matching the refusal)? A promptFeedback.blockReason or equivalent would let us handle it deterministically instead of pattern-matching. Does gemini-3.1-flash-live-preview emit any such signal over the Live WebSocket?

  3. Cellular handoff robustness: the 15-minute audio-only session cap matters here — we likely hit it during the incident session, independent of the cellular flap. Google’s docs recommend contextWindowCompression and sessionResumption for long sessions and for “switching from Wi-Fi to 5G.” Are those the primary recommended mitigations for mobile-cellular use cases, or is there a more foundational pattern I’m missing (for example, is raw WebSocket the wrong transport choice for browser clients, and should we be using WebRTC instead)?

  4. Session resumption + long system instructions: I saw a prior forum thread noting that sessionResumption stops working with system instructions around 200 tokens. Our system prompt is ~325 lines. Is that still a known limitation? If so, is there a recommended pattern for Live applications that need a rich system prompt — move context into tool inputs, shorten the instruction, or something else?

  5. Refusal recovery: is there a supported way to “reset” the Gemini safety classifier within a session (e.g., a message role or instruction that tells it to stop treating the recent refusal as context), or is tearing down and re-establishing the WebSocket session the only reliable path?

Any pointers to undocumented patterns, model-team guidance, or community fixes would be appreciated. Happy to share more trace detail if useful, I have LangSmith captures of the full thread.

Thanks.

At this point we are now evaluating the use Gemma4 to mitigate this issue. We will be testing this hypothesis shortly but I have attached our performance expectations.

We’re seeing a similar session-lockup pattern on `gemini-3.1-flash-live-preview`

in production, different trigger though — fixed-line/Twilio audio (not mobile

handoff). Sharing in case it’s the same underlying bug.

**Symptoms (matches your “session lockup” item):**

  • After caller’s short German utterance, no further response

    emitted from server even after `silence_duration_ms` elapses

  • No `serverContent.turnComplete` or any frame back

  • Subsequent synthetic context injection (LLMContextFrame with assistant+user

    pair) is **ignored** — Gemini stays in stuck state

  • Caller-side audio fully captured (verified post-call via separate STT batch

    on the same mp3)

**Pipeline:**

  • Twilio inbound → Pipecat (`pipecat-ai`) → `GeminiLiveLLMService`

  • Hardcoded VAD: start=HIGH, end=HIGH, silence_duration_ms=2500

  • behavior_config.eagerness=normal

  • prompt size ~30k tokens (system_prompt + per-intent flow body)

  • One-greeting cached audio path (no initial assistant turn in LLMContext)

**Questions for Google folks / community:**

1. Has the cascade-failure pattern from this thread been triaged / acknowledged

 internally? Any ETA on fix?

2. Is **session teardown + reconnect** the recommended escape from a stuck

 session, or is there a less drastic API (e.g. force \`turnComplete=true\`

 server-side)?                                                                                                                                                                                      

3. Does seeding the assistant’s first turn into `LLMContext` history (instead

 of skipping when greeting plays from cache) reduce stuck-session probability?                                                                                                                      

 We currently skip to avoid a double-greeting; trade-off seems suboptimal.                                                                                                                          

4. Has anyone confirmed whether `gemini-2.0-flash-live-001` GA exhibits the

 same lockup, or only the \`3.1-flash-live-preview\`? Considering a fallback path.                                                                                                                    

5. Is there a public metric / status indicator that captures fine-grained

 Live-API behavior shifts? \`aistudio.google.com/status\` shows green even when                                                                                                                       

 we observe \~17% silence baseline.