No. Access to the Imagen models isn’t through the AI Studio API. You’ll need to use the Vertex AI API.
While you can do audio input with Gemini, it doesn’t do audio output. You’ll need to use the Google Text-to-Speech (TTS) API for that.
The largest latency I tend to notice is in the LLM portion itself, not in the STT or TTS portions. What latency numbers are you getting for each?