Gemini Live API Issues: 1008/1011 Disconnects, Per-Session Cost, Function Calling, API Logs

Executive Summary

We are building an English speaking practice platform using Gemini Live API for real-time voice conversations. The API provides excellent audio quality and low latency. This report documents 6 issues we’ve encountered, our implemented workarounds, and requests for guidance.


Our Use Case

We use Gemini Live API (gemini-2.5-flash-native-audio-preview-12-2025) for real-time voice conversations where an AI examiner asks questions, listens to user responses, and provides spoken scoring feedback at the end.

Session Characteristics:

  • Duration: 10-14 minutes
  • Turns: 15-25 per session
  • User speaking: ~66% of session time
  • Full conversation context required for accurate scoring at session end

Connection Configuration:

ai.live.connect({
  model: "gemini-2.5-flash-native-audio-preview-12-2025",
  config: {
    responseModalities: [Modality.AUDIO],
    temperature: 0,
    speechConfig: {
      voiceConfig: { prebuiltVoiceConfig: { voiceName: "Aoede" } }
    },
    systemInstruction: examinerPrompt,  // ~2500 tokens
    generationConfig: {
      thinkingConfig: { thinkingBudget: 1024 }
    },
    inputAudioTranscription: {},
    outputAudioTranscription: {},
    realtimeInputConfig: {
      automaticActivityDetection: { disabled: true }  // Required for transcription flush
    },
    sessionResumption: {},
    tools: [{ functionDeclarations: scoringTools }]
  }
});

Issues Overview

# Issue Status Request
1 Disconnects (1011/1008) Mitigation implemented Root cause, config guidance
2 Not Following Instructions Three-layer workaround toolConfig support, guidance
3 Repetition UI workaround VAD control
4 Transcription Stops (30s+) Workaround implemented Fix underlying issue
5 Per-Session Token/Cost No solution Token/cost API
6 No API Logs Client-side logging Enable logging

Issue 1: WebSocket Disconnects (1011/1008) - CRITICAL

What Happens

WebSocket connections close mid-conversation with codes 1008 (Policy Violation) and 1011 (Internal Error).

Error Messages Captured

From CloseEvent.reason:

// Code 1008 - Policy Violation
"Operation is not implemented, or supported, or enabled."

// Code 1011 - Internal Error
"Failed to run inference"
"Thread was cancelled"
"Thread was cancelled when writing StartStep status to channel.; Failed to close the streaming context; status = CANCELLED:"
"Internal error"
"Internal error encountered."
"RPC::DEADLINE_EXCEEDED"
"RESOURCE_EXHAUSTED"
"service is currently unavailable"

Production Example

Session disconnect at ~10 minutes:

{
  "closeCode": 1011,
  "closeReason": "Thread was cancelled when writing StartStep status to channel.; Failed to close the streaming context; status = CANCELLED:",
  "wasCleanClose": true,
  "goAwayReceivedAt": 1769009709175,
  "goAwayTimeLeftMs": 50000,
  "disconnectTimestamp": 1769009768884,
  "flushEvents": [
    {"type": "started", "timestamp": 1769009246295},
    {"type": "ended", "timestamp": 1769009247197},
    // 19 flush cycles over ~8 minutes (every 15 seconds)
    {"type": "started", "timestamp": 1769009739792},
    {"type": "ended", "timestamp": 1769009740693}
  ]
}

Observed Patterns

Pattern Code Frequency
Sessions 8-12 minutes 1011 Most common
During scoring phase 1008, 1011 Occasional
Mid-conversation 1008, 1011 Occasional

Our Implementation

  1. Auto-Reconnection: Max 3 attempts with exponential backoff (1s, 2s, 4s), resumption token passed to new connection

  2. Context Loss Detection: After reconnect, detect if context was lost by checking for examiner intro phrases → trigger fallback

  3. Server-Side Scoring Fallback: Save transcript and use standard Gemini API for scoring (text-only)

  4. GoAway Handling: Listen for goAway signals and use resumption tokens

Request

  1. Root Cause Documentation: What causes codes 1008 and 1011? Are there limits on session duration, context size, or token count?

  2. Configuration Guidance for 10+ Minute Sessions:

    • Recommended configurations for 7-12 minute sessions?
    • Should we proactively reconnect before the 10-minute mark?
    • Optimal frequency for activityEnd/activityStart flushes?
  3. Code 1008 Clarification: What operation triggers “Operation is not implemented, or supported, or enabled”?


Issue 2: Not Following Conversation Instructions - CRITICAL

What Happens

The model does not reliably follow conversation flow instructions:

  1. Incomplete Spoken Output: Model instructed to speak 5 band scores aloud frequently stops after 1-2 scores or skips entirely

  2. Function Calling: reportScoringResults function called in ~60-70% of sessions

  3. Premature Responses: Model sometimes provides feedback before waiting for user response

Our Use Case

The AI examiner must:

  1. Ask questions → wait for user response → repeat
  2. After all questions, speak all 5 band scores aloud with explanations
  3. Call reportScoringResults function to return structured data

Users require real-time spoken interaction. Only the Live API provides this capability.

Prompt Instructions

We use explicit instructions:

⚠️ CRITICAL: WAIT FOR FINAL ANSWER BEFORE CLOSING
After asking your FINAL question:
1. STOP TALKING completely
2. WAIT for the candidate's complete response (may be 30-60 seconds)
3. Only THEN proceed to the ending sequence

🚨 CRITICAL: You MUST speak ALL 5 scores below. DO NOT STOP after 1 or 2 scores.

Three-Layer Extraction Workaround

Since spoken output is not guaranteed, we extract scores through multiple fallback layers:

  1. Function Calling (reportScoringResults): Primary method
  2. Text Block Parsing: [BAND_SCORES]...[/BAND_SCORES] markers
  3. Regex Extraction: Parse from spoken transcript
// Fallback regex patterns
/(?:i\s+give\s+you|you\s+(?:get|receive))\s*(?:a|an)?\s*(\d+(?:\.\d+)?)\s*(?:for\s*)?(?:fluency)/i
/(?:overall|your overall)\s*(?:band\s*)?(?:score)\s*(?:is|would be)\s*(\d+(?:\.\d+)?)/i

Key Limitation

toolConfig.functionCallingConfig with mode: ANY is not available in Live API:

// Standard Gemini API - supported
toolConfig: { functionCallingConfig: { mode: 'ANY' } }

// Live API - toolConfig not in LiveConnectConfig

Request

  1. toolConfig Support: Add functionCallingConfig to Live API

  2. Guidance for Instruction Following: How to improve reliability of:

    • Speaking complete requested content
    • Waiting for user response before proceeding
    • Calling functions when instructed

Issue 3: Repetition - MEDIUM

What Happens

The AI examiner sometimes repeats itself unprompted, even in quiet environments.

Context

We disable automaticActivityDetection to implement the transcription flush workaround (Issue 4). This prevents us from tuning VAD sensitivity.

Our Workaround

UI reminder for users to use headphones or quiet environment.

Request

  • Ability to configure VAD sensitivity while keeping manual activity control (activityEnd/activityStart)
  • OR improved transcription that removes the need for periodic flush

Issue 4: Transcription Stops During Long Speech (30s+) - SOLVED

What Happens

inputTranscription events stop or degrade during continuous user speech longer than ~30 seconds.

Solution

Credit to the community:

  1. Disable automatic activity detection:
realtimeInputConfig: {
  automaticActivityDetection: { disabled: true }
}
  1. Send periodic flush every 15 seconds:
session.send({ clientContent: { turnComplete: false } }); // activityEnd
session.send({ realtimeInput: { activityStart: {} } });   // activityStart

Results

Same 2-minute continuous speech:

  • Without flush: 285 characters
  • With flush: 2,064 characters

References

Request

Fix the underlying transcription issue so that inputTranscription events are reliably delivered during continuous speech without requiring manual flush workarounds.


Issue 5: Per-Session Token/Cost Visibility - NEEDS FEATURE

What Happens

Cannot accurately track token consumption or calculate costs for individual Live API sessions.

Pricing (from documentation)

Component Price per 1M tokens
Input text $0.50
Input audio $3.00
Output text $2.00
Output audio $12.00

Audio token rate: 32 tokens per second. Billing is per-turn for entire context window (cumulative).

Issues

Problem Description
No per-session data Cloud Console shows only aggregate usage, not per-session breakdown
Inconsistent usageMetadata promptTokensDetails / responseTokensDetails appear intermittently in API responses

Request

  1. Per-Session Token Consumption API: Query token usage (text + audio) for individual sessions
  2. Consistent usageMetadata: Always populate modality breakdown
  3. Cloud Console Visibility: Per-session token and cost data

Issue 6: No API Logs - NEEDS FEATURE

What Happens

Live API sessions do not appear in Gemini API Logs and Datasets in Google Cloud Console.

API Type Visible in Logs?
Standard Gemini API Yes
Gemini Live API No

Impact

  • Cannot debug 1011 errors server-side
  • Cannot verify token counts
  • Cannot investigate function call failures
  • Cannot analyze audio processing issues

Our Workaround

Client-side logging to our database:

  • WebSocket close codes and reasons
  • Transcript events
  • Function call success/failure
  • Timing data

Request

  1. Enable Live API Logging: Add to Gemini API Logs and Datasets
  2. Per-Session ID: Correlation between our sessions and Google’s internal logs
  3. Error Details: More information when 1011 occurs

Summary

# Issue Our Implementation Request
1 Disconnects (1011/1008) Auto-reconnection + scoring fallback Root cause, config guidance
2 Not Following Instructions Three-layer extraction toolConfig support, guidance
3 Repetition UI reminder VAD control with manual activity
4 Transcription Stops (30s+) Periodic flush workaround Fix underlying issue
5 Per-Session Token/Cost N/A Token API, consistent usageMetadata
6 No API Logs Client-side logging Enable Live API logging

Contact

We can provide additional logs, sample sessions, or code samples to help investigate these issues.

2 Likes