Reducing latency for gemini audio prompt requests?

Hey all,

I’m trying to make a voice based ai chat app, so latency is critical for the product. In theory the live api would be perfect for this however there’s a few limitations with it which means I can’t go with that approach. Right now I’m treating it like any other chat app, however the prompt from the user contains audio data (usually around 5 seconds of webm audio). There’s no audio output from gemini, just text based output. I’m finding the latency is quite high for my use case. I’m using the streaming endpoint and I’m getting regularly around 1.1s for the time from when the request is sent to when I get back the first chunk of data from streaming. If I remove the audio prompt from the user and replace it with a plain text prompt the latency drops to around ~400ms which is more in the ballpark of what I was looking for.

I’m wondering if anyone else has encountered the same problem and if there’s anything I can do to reduce this latency?

To add some more context I’m using gemini-2.0-flash-lite. I’m providing a system prompt with each request that is around 300 tokens.