Live API latency spikes

Using Live API with gemini-live-2.5-flash-preview on a audio/pcm;rate=8000 audio chunks stream and streaming responses, the latency sometimes spikes and the wait time goes to 7-15 seconds to first token (measuring from audio stream end). Narrowing down the problem, the most latency is coming from transcription (server_content.input_transcription) which took up to 30 seconds during testing (measuring from audio stream beginning)

Here is config we are using:
config = types.LiveConnectConfig(
realtime_input_config=types.RealtimeInputConfig(
automatic_activity_detection=types.AutomaticActivityDetection(
start_of_speech_sensitivity=types.StartSensitivity.START_SENSITIVITY_HIGH,
end_of_speech_sensitivity=types.EndSensitivity.END_SENSITIVITY_LOW,
silence_duration_ms=int(silence_duration * 1000),
),
turn_coverage=types.TurnCoverage.TURN_COVERAGE_UNSPECIFIED,
),
response_modalities=[“TEXT”],
system_instruction=types.Content(
parts=[types.Part.from_text(text=f"{preprompt}\n{system_instruction}")],
role=“user”,
),
media_resolution=“MEDIA_RESOLUTION_MEDIUM”,
input_audio_transcription=dict(),
speech_config=types.SpeechConfig(
language_code=language_iso_code,
voice_config=types.VoiceConfig(prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name=“Puck”)),
),
context_window_compression=types.ContextWindowCompressionConfig(
trigger_tokens=25600,
sliding_window=types.SlidingWindow(target_tokens=12800),
),
)

Hello
welcome to the forum!!

the random 15–30s latency spikes because of Using END_SENSITIVITY_LOW on 8kHz audio is problematic. The model interprets background line static as “whispering,” causing it to keep the turn open until the hard server timeout (~30s).

The Solution:

Set Sensitivity to HIGH: end_of_speech_sensitivity=HIGH. This forces the model to ignore the static.

Increase Silence Duration: Set silence_duration_ms=1000. Since sensitivity is high, this 1-second buffer ensures natural pauses aren’t cut off.

Remove media_resolution: This parameter is for video/images only and should be removed from audio-only configs to keep things clean.

Thanks

1 Like