How to get text output from gemini-2.5-flash-preview-native-audio-dialog

Hi everyone,

I’m building a language learning application that requires real-time voice interaction. I’ve extensively tested various Gemini models for our voice pipeline but am experiencing latencies that make them unsuitable for the use case. I’m hoping to get guidance on whether these are expected limitations of preview models or if I’m missing key optimizations.

Context

  • Use case: Real-time voice interaction for VR language tutoring
  • Target latency: <3 seconds total pipeline
  • Current setup: Python, google-genai SDK, FastAPI backend
  • Production requirements: Reliable, consistent performance for live users

Tested Architectures & Results

1. Classic Pipeline :white_check_mark: (Working)

Flow: Groq Whisper STT → Gemini 2.5 Flash LLM → Google Cloud TTS
Latency: 3-4s total (0.7s + 1.5s + 0.8s)

2. Classic with Gemini TTS :cross_mark: (Too slow)

Flow: Groq Whisper STT → Gemini 2.5 Flash LLM → gemini-2.5-flash-preview-tts
Latency: 7-8s total (0.7s + 1.5s + 4-5s TTS)
Issue: TTS alone takes 4-5s via generate_content()

What I’ve tried:

  • Singleton pattern for client reuse (saved 1-3s)
  • Attempted AFC disable via automatic_function_calling config (no effect)
  • Logs still show: “AFC is enabled with max remote calls: 10”

3. Push-to-Talk Gemini Live API :cross_mark: (Too slow)

Flow: Groq Whisper STT → gemini-2.5-flash-preview-native-audio-dialog
Latency: 6-11s total (0.7s + 5.6-9.3s Live API)
Implementation: Using SDK’s aio.live.connect() with text input

Observed issues:

  • Connection setup: ~1s
  • Processing: 5.6-9.3s for responses
  • Model marketed as “very low latency” but showing opposite

4. Voice-Activated Live API :cross_mark: (Also slow)

Flow: Browser VAD → gemini-2.5-flash-preview-native-audio-dialog
Latency: Similar 5-10s delays
Note: Direct audio streaming didn’t improve latency significantly

Production Logs Sample

STT: Processing 60053 bytes with model whisper-large-v3-turbo
STT Result: 90 characters transcribed
STT exceeded threshold: 4.34s
SDK processed text to audio for session test-session-123
Gemini Live API exceeded threshold: 6.86s
Pipeline exceeded threshold: 11.20s

Key Questions

  1. Are these latencies expected for preview models? The 4-5s for TTS and 5-9s for Live API seem inconsistent with “low latency” claims.
  2. Is there a specialized TTS endpoint? Currently using the multimodal generate_content() which seems overkill for simple TTS. The AFC overhead appears unavoidable.
  3. Should I be using different models?
  • Is gemini-2.0-flash-live-001 more stable/faster than the 2.5 preview?
  • Are the preview models intended for production use at all?
  1. Architecture guidance: For <3s latency requirement, should I:
  • Continue with Google Cloud TTS until Gemini models mature?
  • Try a different architecture I haven’t considered?
  • Wait for production-ready versions (any timeline?)?
  1. Known issues: I’ve found similar reports in GitHub issues #866, #865, and #335 showing ConnectionClosedError: Cannot extract voices from a non-audio request. Is this being addressed?

Technical Details

  • TTS Configuration:

python

response_modalities=["AUDIO"],
speech_config=types.SpeechConfig(
    voice_config=types.VoiceConfig(
        prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name='Kore')
    )
)
  • Environment: Python 3.13, google-genai latest, production servers

What I Need

I need to make a decision on whether to:

  1. Wait for Gemini models to improve (if so, what’s the timeline?)
  2. Look for another TTS / realtime model
  3. Implement workarounds I haven’t discovered

Any insights from the community or Google team would be greatly appreciated. I’m particularly interested in hearing from others who’ve successfully deployed Gemini audio models in with low latency requirements.

Thank you!

Hi everyone,

I’m building a VR language learning app that requires both audio output (for voice) and text output (for subtitles and chat history) from the Gemini native audio dialog model. However, I’m getting conflicting information about whether this is supported and how to implement it correctly.

What I Need

  • Audio output: For natural voice conversation (working :white_check_mark:)
  • Text output: For displaying subtitles in real-time and saving chat history
  • Model: gemini-2.5-flash-preview-native-audio-dialog via Live API

What I’ve Tried

Attempt 1: Dual Modalities :cross_mark:

python

config = types.LiveConnectConfig(
    response_modalities=["AUDIO", "TEXT"],  # This causes errors
    speech_config=types.SpeechConfig(...)
)

Result: WebSocket error 1007 “Cannot extract voices from a non-audio request” or complete non-responsiveness

Attempt 2: Audio Only :white_check_mark:

python

config = types.LiveConnectConfig(
    response_modalities=["AUDIO"],  # Works fine
    speech_config=types.SpeechConfig(...)
)

Result: Audio works perfectly, but no text for subtitles

Conflicting Information

I’ve found conflicting documentation about text output support:

  1. Official docs state the model outputs “Text and audio, interleaved”
  2. GitHub issue #380 suggests using output_audio_transcription:

python

config = types.LiveConnectConfig(
    response_modalities=["AUDIO"],
    output_audio_transcription=types.AudioTranscriptionConfig()
)
  1. Developer reports indicate response_modalities=["AUDIO", "TEXT"] doesn’t work

Questions

  1. What is the correct way to get both audio and text output from gemini-2.5-flash-preview-native-audio-dialog?
  2. Is output_audio_transcription the official method? If so:
  • Does this provide real-time text synchronized with audio?
  • Is the text quality suitable for displaying as subtitles?
  • Can I access the full conversation text for chat history?
  1. Why does response_modalities=["AUDIO", "TEXT"] fail? Is this:
  • A bug in the SDK?
  • An API limitation?
  • Incorrect usage on my part?

Environment

  • SDK: google-genai (latest)
  • Python: 3.13

Any clarification on the proper way to get text output for subtitles and chat history would be greatly appreciated.

Thank you!

I found the correct way to get both text and audio output from gemini-2.5-flash-preview-native-audio-dialog.

:white_check_mark: Working Solution

Don’t use response_modalities=["AUDIO", "TEXT"] - this causes errors.

Instead, use output_audio_transcription:

python

config = types.LiveConnectConfig(
    response_modalities=["AUDIO"],  # Audio only here
    output_audio_transcription=types.AudioTranscriptionConfig()  # This enables text
)

Then handle both outputs in your receive loop:

python

async for response in session.receive():
    # Text transcription
    if response.server_content and response.server_content.output_transcription:
        text = response.server_content.output_transcription.text
        display_subtitles(text)  # Perfect for subtitles!
    
    # Audio data
    if response.server_content and response.server_content.model_turn:
        for part in response.server_content.model_turn.parts:
            if part.audio and part.audio.data:
                play_audio(part.audio.data)

Hope this helps others facing the same issue!