@Logan_Kilpatrick @Mustan_lokhand
Hi team,
I’m seeing consistently high response latency with Gemini Live API in a production-like voice simulation setup.
Use case
We use Gemini as a simulated user/caller to test a target voice agent (target agent speaks first).
Flow is: Twilio Media Stream → Server → Gemini Live WS → Server → UI.
Model
gemini-2.5-flash-native-audio-preview-12-2025
What we are already doing (per docs / best practices)
-
Input audio sent as 16-bit PCM, 16kHz, mono (audio/pcm;rate=16000)
-
Output audio handled at 24kHz
-
Audio sent in 20ms chunks (within recommended 20–40ms)
-
Long-lived websocket session (no reconnect per turn)
-
Ordered ingest/serialized dispatch (no parallel frame processing)
-
Minimal client buffering (small startup buffer only)
-
Local/manual VAD with explicit activityStart / activityEnd
-
Manual silence threshold around 510ms
-
We log:
-
activityEnd requested
-
activityEnd sent
-
queueDrainMs
-
geminiProcessingMs
-
first response audio timestamp
-
Observed behavior
-
End-to-end perceived latency often 5–6s, sometimes higher on first turn
-
In many turns, local queue drain is now low (often ~100–300ms), but geminiProcessingMs is often ~2.2–3.8s and sometimes ~5.4s
-
Example first turn from logs:
-
queuedChunksAtActivityEnd: 83
-
queueDrainMs: 1654ms
-
geminiProcessingMs: 2245ms
-
-
Another run showed geminiProcessingMs around 5423ms even with very low queue drain
Questions
-
Is this latency profile expected for this model in speech-to-speech proxy architecture?
-
Are there known server-side factors (region, context growth, session duration, model settings) that cause 5s+ spikes?
-
Any recommended tuning for manual VAD + telephony streams beyond what we already do?