I’m experiencing unexpected behavior with input_audio_transcription in the Live API (gemini-live-2.5-flash-preview, I also tried gemini-2.5-flash-preview-native-audio-dialog). When a user speaks for extended periods (say >15 seconds), the input_transcription stream stops sending chunks after approximately 100 chunks, even though:
1. Audio is still being captured and sent to the Live API
2. The AI continues to process all audio correctly - when I ask the AI about
content spoken after transcription stopped, it recalls everything accurately
3. Only the transcription stream stops - not the underlying audio processing. Since the input transcript stops, it feels like a delay before AI responses.
The output transcription works perfectly.
Observed Behavior:
-
Transcription chunks flow normally for first ~100 chunks (~10 seconds)
-
Then response.server_content.input_transcription stops appearing in responses
-
Audio continues to be processed (proven by AI’s accurate responses)
-
After user finishes speaking, AI responds correctly using ALL spoken content
Environment
-
Model: gemini-live-2.5-flash-preview. I also tried gemini-2.5-flash-preview-native-audio-dialog. Similar thing happened.
-
Audio Format: PCM 16-bit, 16kHz mono
-
Audio Chunk Size: ~5KB per WebSocket send (buffered from 8x 1024-sample chunks)
Code Sample
Configuration:
config = {
"response_modalities": \["AUDIO"\],
"system_instruction": base_prompt, # \~200 lines
"output_audio_transcription": {},
"input_audio_transcription": {},
}
async with client.aio.live.connect(model=“gemini-live-2.5-flash-preview”,
config=config) as session:
\# ... audio input/output handlers
Audio Output Handler (receives responses from AI):
async def handle_audio_output():
while session_active:
turn = session.receive()
async for response in turn:
\# AI's audio output - works fine
if response.data:
await websocket.send(audio_data)
\# AI's transcription - works fine
if response.server_content and
response.server_content.output_transcription:
ai_text = response.server_content.output_transcription.text
logger.info(f"\[AI\] {ai_text}")
\# USER's transcription - STOPS after \~100 chunks
if response.server_content and
response.server_content.input_transcription:
user_text = response.server_content.input_transcription.text
transcription_chunk_count += 1
logger.info(f"\[USER\] Chunk #{transcription_chunk_count}:
{user_text}")
\# Stops logging after chunk \~100, but audio keeps flowing
Observed Log Pattern:
[USER] Chunk #1: " A"
[USER] Chunk #2: " use"
[USER] Chunk #3: " Go"
…
[USER] Chunk #98: " of"
[USER] Chunk #99: " fun"
[USER] Chunk #100: " for"
[USER] Chunk #101: " us"
# … then nothing, even though user continues speaking for 20+ more seconds
# AI responds accurately to ALL content including post-chunk-100 speech
I wonder if this is an expected behavior of Live API or I am doing something wrong in my code. It feels to me there’s a magic knob somewhere that can fix this issue with a simple turn. Any suggestions / helps are greatly appreciated.