Hi everyone,
I’m building a language learning application that requires real-time voice interaction. I’ve extensively tested various Gemini models for our voice pipeline but am experiencing latencies that make them unsuitable for the use case. I’m hoping to get guidance on whether these are expected limitations of preview models or if I’m missing key optimizations.
Context
- Use case: Real-time voice interaction for VR language tutoring
- Target latency: <3 seconds total pipeline
- Current setup: Python, google-genai SDK, FastAPI backend
- Production requirements: Reliable, consistent performance for live users
Tested Architectures & Results
1. Classic Pipeline
(Working)
Flow: Groq Whisper STT → Gemini 2.5 Flash LLM → Google Cloud TTS
Latency: 3-4s total (0.7s + 1.5s + 0.8s)
2. Classic with Gemini TTS
(Too slow)
Flow: Groq Whisper STT → Gemini 2.5 Flash LLM → gemini-2.5-flash-preview-tts
Latency: 7-8s total (0.7s + 1.5s + 4-5s TTS)
Issue: TTS alone takes 4-5s via generate_content()
What I’ve tried:
- Singleton pattern for client reuse (saved 1-3s)
- Attempted AFC disable via
automatic_function_calling
config (no effect) - Logs still show: “AFC is enabled with max remote calls: 10”
3. Push-to-Talk Gemini Live API
(Too slow)
Flow: Groq Whisper STT → gemini-2.5-flash-preview-native-audio-dialog
Latency: 6-11s total (0.7s + 5.6-9.3s Live API)
Implementation: Using SDK’s aio.live.connect()
with text input
Observed issues:
- Connection setup: ~1s
- Processing: 5.6-9.3s for responses
- Model marketed as “very low latency” but showing opposite
4. Voice-Activated Live API
(Also slow)
Flow: Browser VAD → gemini-2.5-flash-preview-native-audio-dialog
Latency: Similar 5-10s delays
Note: Direct audio streaming didn’t improve latency significantly
Production Logs Sample
STT: Processing 60053 bytes with model whisper-large-v3-turbo
STT Result: 90 characters transcribed
STT exceeded threshold: 4.34s
SDK processed text to audio for session test-session-123
Gemini Live API exceeded threshold: 6.86s
Pipeline exceeded threshold: 11.20s
Key Questions
- Are these latencies expected for preview models? The 4-5s for TTS and 5-9s for Live API seem inconsistent with “low latency” claims.
- Is there a specialized TTS endpoint? Currently using the multimodal
generate_content()
which seems overkill for simple TTS. The AFC overhead appears unavoidable. - Should I be using different models?
- Is
gemini-2.0-flash-live-001
more stable/faster than the 2.5 preview? - Are the preview models intended for production use at all?
- Architecture guidance: For <3s latency requirement, should I:
- Continue with Google Cloud TTS until Gemini models mature?
- Try a different architecture I haven’t considered?
- Wait for production-ready versions (any timeline?)?
- Known issues: I’ve found similar reports in GitHub issues #866, #865, and #335 showing
ConnectionClosedError: Cannot extract voices from a non-audio request
. Is this being addressed?
Technical Details
- TTS Configuration:
python
response_modalities=["AUDIO"],
speech_config=types.SpeechConfig(
voice_config=types.VoiceConfig(
prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name='Kore')
)
)
- Environment: Python 3.13, google-genai latest, production servers
What I Need
I need to make a decision on whether to:
- Wait for Gemini models to improve (if so, what’s the timeline?)
- Look for another TTS / realtime model
- Implement workarounds I haven’t discovered
Any insights from the community or Google team would be greatly appreciated. I’m particularly interested in hearing from others who’ve successfully deployed Gemini audio models in with low latency requirements.
Thank you!