Using Live API with gemini-live-2.5-flash-preview on a audio/pcm;rate=8000 audio chunks stream and streaming responses, the latency sometimes spikes and the wait time goes to 7-15 seconds to first token (measuring from audio stream end). Narrowing down the problem, the most latency is coming from transcription (server_content.input_transcription) which took up to 30 seconds during testing (measuring from audio stream beginning)
Here is config we are using:
config = types.LiveConnectConfig(
realtime_input_config=types.RealtimeInputConfig(
automatic_activity_detection=types.AutomaticActivityDetection(
start_of_speech_sensitivity=types.StartSensitivity.START_SENSITIVITY_HIGH,
end_of_speech_sensitivity=types.EndSensitivity.END_SENSITIVITY_LOW,
silence_duration_ms=int(silence_duration * 1000),
),
turn_coverage=types.TurnCoverage.TURN_COVERAGE_UNSPECIFIED,
),
response_modalities=[“TEXT”],
system_instruction=types.Content(
parts=[types.Part.from_text(text=f"{preprompt}\n{system_instruction}")],
role=“user”,
),
media_resolution=“MEDIA_RESOLUTION_MEDIUM”,
input_audio_transcription=dict(),
speech_config=types.SpeechConfig(
language_code=language_iso_code,
voice_config=types.VoiceConfig(prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name=“Puck”)),
),
context_window_compression=types.ContextWindowCompressionConfig(
trigger_tokens=25600,
sliding_window=types.SlidingWindow(target_tokens=12800),
),
)