Executive Summary
We are building an English speaking practice platform using Gemini Live API for real-time voice conversations. The API provides excellent audio quality and low latency. This report documents 6 issues we’ve encountered, our implemented workarounds, and requests for guidance.
Our Use Case
We use Gemini Live API (gemini-2.5-flash-native-audio-preview-12-2025) for real-time voice conversations where an AI examiner asks questions, listens to user responses, and provides spoken scoring feedback at the end.
Session Characteristics:
- Duration: 10-14 minutes
- Turns: 15-25 per session
- User speaking: ~66% of session time
- Full conversation context required for accurate scoring at session end
Connection Configuration:
ai.live.connect({
model: "gemini-2.5-flash-native-audio-preview-12-2025",
config: {
responseModalities: [Modality.AUDIO],
temperature: 0,
speechConfig: {
voiceConfig: { prebuiltVoiceConfig: { voiceName: "Aoede" } }
},
systemInstruction: examinerPrompt, // ~2500 tokens
generationConfig: {
thinkingConfig: { thinkingBudget: 1024 }
},
inputAudioTranscription: {},
outputAudioTranscription: {},
realtimeInputConfig: {
automaticActivityDetection: { disabled: true } // Required for transcription flush
},
sessionResumption: {},
tools: [{ functionDeclarations: scoringTools }]
}
});
Issues Overview
| # | Issue | Status | Request |
|---|---|---|---|
| 1 | Disconnects (1011/1008) | Mitigation implemented | Root cause, config guidance |
| 2 | Not Following Instructions | Three-layer workaround | toolConfig support, guidance |
| 3 | Repetition | UI workaround | VAD control |
| 4 | Transcription Stops (30s+) | Workaround implemented | Fix underlying issue |
| 5 | Per-Session Token/Cost | No solution | Token/cost API |
| 6 | No API Logs | Client-side logging | Enable logging |
Issue 1: WebSocket Disconnects (1011/1008) - CRITICAL
What Happens
WebSocket connections close mid-conversation with codes 1008 (Policy Violation) and 1011 (Internal Error).
Error Messages Captured
From CloseEvent.reason:
// Code 1008 - Policy Violation
"Operation is not implemented, or supported, or enabled."
// Code 1011 - Internal Error
"Failed to run inference"
"Thread was cancelled"
"Thread was cancelled when writing StartStep status to channel.; Failed to close the streaming context; status = CANCELLED:"
"Internal error"
"Internal error encountered."
"RPC::DEADLINE_EXCEEDED"
"RESOURCE_EXHAUSTED"
"service is currently unavailable"
Production Example
Session disconnect at ~10 minutes:
{
"closeCode": 1011,
"closeReason": "Thread was cancelled when writing StartStep status to channel.; Failed to close the streaming context; status = CANCELLED:",
"wasCleanClose": true,
"goAwayReceivedAt": 1769009709175,
"goAwayTimeLeftMs": 50000,
"disconnectTimestamp": 1769009768884,
"flushEvents": [
{"type": "started", "timestamp": 1769009246295},
{"type": "ended", "timestamp": 1769009247197},
// 19 flush cycles over ~8 minutes (every 15 seconds)
{"type": "started", "timestamp": 1769009739792},
{"type": "ended", "timestamp": 1769009740693}
]
}
Observed Patterns
| Pattern | Code | Frequency |
|---|---|---|
| Sessions 8-12 minutes | 1011 | Most common |
| During scoring phase | 1008, 1011 | Occasional |
| Mid-conversation | 1008, 1011 | Occasional |
Our Implementation
-
Auto-Reconnection: Max 3 attempts with exponential backoff (1s, 2s, 4s), resumption token passed to new connection
-
Context Loss Detection: After reconnect, detect if context was lost by checking for examiner intro phrases → trigger fallback
-
Server-Side Scoring Fallback: Save transcript and use standard Gemini API for scoring (text-only)
-
GoAway Handling: Listen for
goAwaysignals and use resumption tokens
Request
-
Root Cause Documentation: What causes codes 1008 and 1011? Are there limits on session duration, context size, or token count?
-
Configuration Guidance for 10+ Minute Sessions:
- Recommended configurations for 7-12 minute sessions?
- Should we proactively reconnect before the 10-minute mark?
- Optimal frequency for
activityEnd/activityStartflushes?
-
Code 1008 Clarification: What operation triggers “Operation is not implemented, or supported, or enabled”?
Issue 2: Not Following Conversation Instructions - CRITICAL
What Happens
The model does not reliably follow conversation flow instructions:
-
Incomplete Spoken Output: Model instructed to speak 5 band scores aloud frequently stops after 1-2 scores or skips entirely
-
Function Calling:
reportScoringResultsfunction called in ~60-70% of sessions -
Premature Responses: Model sometimes provides feedback before waiting for user response
Our Use Case
The AI examiner must:
- Ask questions → wait for user response → repeat
- After all questions, speak all 5 band scores aloud with explanations
- Call
reportScoringResultsfunction to return structured data
Users require real-time spoken interaction. Only the Live API provides this capability.
Prompt Instructions
We use explicit instructions:
⚠️ CRITICAL: WAIT FOR FINAL ANSWER BEFORE CLOSING
After asking your FINAL question:
1. STOP TALKING completely
2. WAIT for the candidate's complete response (may be 30-60 seconds)
3. Only THEN proceed to the ending sequence
🚨 CRITICAL: You MUST speak ALL 5 scores below. DO NOT STOP after 1 or 2 scores.
Three-Layer Extraction Workaround
Since spoken output is not guaranteed, we extract scores through multiple fallback layers:
- Function Calling (
reportScoringResults): Primary method - Text Block Parsing:
[BAND_SCORES]...[/BAND_SCORES]markers - Regex Extraction: Parse from spoken transcript
// Fallback regex patterns
/(?:i\s+give\s+you|you\s+(?:get|receive))\s*(?:a|an)?\s*(\d+(?:\.\d+)?)\s*(?:for\s*)?(?:fluency)/i
/(?:overall|your overall)\s*(?:band\s*)?(?:score)\s*(?:is|would be)\s*(\d+(?:\.\d+)?)/i
Key Limitation
toolConfig.functionCallingConfig with mode: ANY is not available in Live API:
// Standard Gemini API - supported
toolConfig: { functionCallingConfig: { mode: 'ANY' } }
// Live API - toolConfig not in LiveConnectConfig
Request
-
toolConfigSupport: AddfunctionCallingConfigto Live API -
Guidance for Instruction Following: How to improve reliability of:
- Speaking complete requested content
- Waiting for user response before proceeding
- Calling functions when instructed
Issue 3: Repetition - MEDIUM
What Happens
The AI examiner sometimes repeats itself unprompted, even in quiet environments.
Context
We disable automaticActivityDetection to implement the transcription flush workaround (Issue 4). This prevents us from tuning VAD sensitivity.
Our Workaround
UI reminder for users to use headphones or quiet environment.
Request
- Ability to configure VAD sensitivity while keeping manual activity control (
activityEnd/activityStart) - OR improved transcription that removes the need for periodic flush
Issue 4: Transcription Stops During Long Speech (30s+) - SOLVED
What Happens
inputTranscription events stop or degrade during continuous user speech longer than ~30 seconds.
Solution
Credit to the community:
- Disable automatic activity detection:
realtimeInputConfig: {
automaticActivityDetection: { disabled: true }
}
- Send periodic flush every 15 seconds:
session.send({ clientContent: { turnComplete: false } }); // activityEnd
session.send({ realtimeInput: { activityStart: {} } }); // activityStart
Results
Same 2-minute continuous speech:
- Without flush: 285 characters
- With flush: 2,064 characters
References
- gemini-workshop/TRANSCRIPTION_LIMITS.md at master · icapora/gemini-workshop · GitHub
- Gemini Live API: Delays or Missing input_audio_transcription Events - #5 by icapora
Request
Fix the underlying transcription issue so that inputTranscription events are reliably delivered during continuous speech without requiring manual flush workarounds.
Issue 5: Per-Session Token/Cost Visibility - NEEDS FEATURE
What Happens
Cannot accurately track token consumption or calculate costs for individual Live API sessions.
Pricing (from documentation)
| Component | Price per 1M tokens |
|---|---|
| Input text | $0.50 |
| Input audio | $3.00 |
| Output text | $2.00 |
| Output audio | $12.00 |
Audio token rate: 32 tokens per second. Billing is per-turn for entire context window (cumulative).
Issues
| Problem | Description |
|---|---|
| No per-session data | Cloud Console shows only aggregate usage, not per-session breakdown |
Inconsistent usageMetadata |
promptTokensDetails / responseTokensDetails appear intermittently in API responses |
Request
- Per-Session Token Consumption API: Query token usage (text + audio) for individual sessions
- Consistent
usageMetadata: Always populate modality breakdown - Cloud Console Visibility: Per-session token and cost data
Issue 6: No API Logs - NEEDS FEATURE
What Happens
Live API sessions do not appear in Gemini API Logs and Datasets in Google Cloud Console.
| API Type | Visible in Logs? |
|---|---|
| Standard Gemini API | Yes |
| Gemini Live API | No |
Impact
- Cannot debug 1011 errors server-side
- Cannot verify token counts
- Cannot investigate function call failures
- Cannot analyze audio processing issues
Our Workaround
Client-side logging to our database:
- WebSocket close codes and reasons
- Transcript events
- Function call success/failure
- Timing data
Request
- Enable Live API Logging: Add to Gemini API Logs and Datasets
- Per-Session ID: Correlation between our sessions and Google’s internal logs
- Error Details: More information when 1011 occurs
Summary
| # | Issue | Our Implementation | Request |
|---|---|---|---|
| 1 | Disconnects (1011/1008) | Auto-reconnection + scoring fallback | Root cause, config guidance |
| 2 | Not Following Instructions | Three-layer extraction | toolConfig support, guidance |
| 3 | Repetition | UI reminder | VAD control with manual activity |
| 4 | Transcription Stops (30s+) | Periodic flush workaround | Fix underlying issue |
| 5 | Per-Session Token/Cost | N/A | Token API, consistent usageMetadata |
| 6 | No API Logs | Client-side logging | Enable Live API logging |
Contact
We can provide additional logs, sample sessions, or code samples to help investigate these issues.