Reducing latency for gemini audio prompt requests?

Hey all,

I’m trying to make a voice based ai chat app, so latency is critical for the product. In theory the live api would be perfect for this however there’s a few limitations with it which means I can’t go with that approach. Right now I’m treating it like any other chat app, however the prompt from the user contains audio data (usually around 5 seconds of webm audio). There’s no audio output from gemini, just text based output. I’m finding the latency is quite high for my use case. I’m using the streaming endpoint and I’m getting regularly around 1.1s for the time from when the request is sent to when I get back the first chunk of data from streaming. If I remove the audio prompt from the user and replace it with a plain text prompt the latency drops to around ~400ms which is more in the ballpark of what I was looking for.

I’m wondering if anyone else has encountered the same problem and if there’s anything I can do to reduce this latency?

To add some more context I’m using gemini-2.0-flash-lite. I’m providing a system prompt with each request that is around 300 tokens.

Hi @Stewart_Connor Apologies for late response
Few optimizing techniques which you can try for your current approach are

  1. Client-Side VAD (Voice Activity Detection) this allows you to send only the portions of audio where actual speech is present, reducing the amount of “silent” audio.
  2. You mentioned using the streaming endpoint. Are you streaming the audio in chunks as it’s captured, or are you waiting for the entire 5 seconds to be recorded before sending the whole audio blob?
  3. While 300 tokens for a system prompt isn’t excessively large, every token adds to the processing time. If there are general instructions that apply across a conversation, consider if you can optimize it. However, for context-rich interactions, a well-defined system prompt is crucial.
  4. Ensure you’re streaming output as soon as the first chunk arrives. For a voice app, this means you can start TTS on the first received words, creating an interruptible and more natural conversational flow. This is about perceived latency.