Reducing latency for gemini audio prompt requests?

Stewart_Connor · April 2, 2025, 7:32am

Hey all,

I’m trying to make a voice based ai chat app, so latency is critical for the product. In theory the live api would be perfect for this however there’s a few limitations with it which means I can’t go with that approach. Right now I’m treating it like any other chat app, however the prompt from the user contains audio data (usually around 5 seconds of webm audio). There’s no audio output from gemini, just text based output. I’m finding the latency is quite high for my use case. I’m using the streaming endpoint and I’m getting regularly around 1.1s for the time from when the request is sent to when I get back the first chunk of data from streaming. If I remove the audio prompt from the user and replace it with a plain text prompt the latency drops to around ~400ms which is more in the ballpark of what I was looking for.

I’m wondering if anyone else has encountered the same problem and if there’s anything I can do to reduce this latency?

To add some more context I’m using gemini-2.0-flash-lite. I’m providing a system prompt with each request that is around 300 tokens.

Pannaga_J · June 3, 2025, 8:42am

Hi @Stewart_Connor Apologies for late response
Few optimizing techniques which you can try for your current approach are

Client-Side VAD (Voice Activity Detection) this allows you to send only the portions of audio where actual speech is present, reducing the amount of “silent” audio.
You mentioned using the streaming endpoint. Are you streaming the audio in chunks as it’s captured, or are you waiting for the entire 5 seconds to be recorded before sending the whole audio blob?
While 300 tokens for a system prompt isn’t excessively large, every token adds to the processing time. If there are general instructions that apply across a conversation, consider if you can optimize it. However, for context-rich interactions, a well-defined system prompt is crucial.
Ensure you’re streaming output as soon as the first chunk arrives. For a voice app, this means you can start TTS on the first received words, creating an interruptible and more natural conversational flow. This is about perceived latency.

Topic		Replies	Views
Facing some serious lag in responses with Gemini 2.0 in audio modality in multimodal live API Gemini API api , gemini-20	2	162	June 11, 2025
Latency problems API gemini 2.0 flash multimodal life Gemini API api , audio , gemini-flash , gemini-20	2	98	March 25, 2025
How to get text output from gemini-2.5-flash-preview-native-audio-dialog Gemini API showcase	2	79	June 18, 2025
There is Lag when using the MultiModal API from the open source code Gemini API api , models	1	85	February 25, 2025
Gemini Flash TTS speed? hows your experience? Gemini API gemini-api	1	136	June 11, 2025

Reducing latency for gemini audio prompt requests?

Related topics