How to get text output from gemini-2.5-flash-preview-native-audio-dialog

Peter_Kruck · June 18, 2025, 3:35pm

Hi everyone,

I’m building a language learning application that requires real-time voice interaction. I’ve extensively tested various Gemini models for our voice pipeline but am experiencing latencies that make them unsuitable for the use case. I’m hoping to get guidance on whether these are expected limitations of preview models or if I’m missing key optimizations.

Context

Use case: Real-time voice interaction for VR language tutoring
Target latency: <3 seconds total pipeline
Current setup: Python, google-genai SDK, FastAPI backend
Production requirements: Reliable, consistent performance for live users

Tested Architectures & Results

1. Classic Pipeline (Working)

Flow: Groq Whisper STT → Gemini 2.5 Flash LLM → Google Cloud TTS
Latency: 3-4s total (0.7s + 1.5s + 0.8s)

2. Classic with Gemini TTS (Too slow)

Flow: Groq Whisper STT → Gemini 2.5 Flash LLM → gemini-2.5-flash-preview-tts
Latency: 7-8s total (0.7s + 1.5s + 4-5s TTS)
Issue: TTS alone takes 4-5s via generate_content()

What I’ve tried:

Singleton pattern for client reuse (saved 1-3s)
Attempted AFC disable via automatic_function_calling config (no effect)
Logs still show: “AFC is enabled with max remote calls: 10”

3. Push-to-Talk Gemini Live API (Too slow)

Flow: Groq Whisper STT → gemini-2.5-flash-preview-native-audio-dialog
Latency: 6-11s total (0.7s + 5.6-9.3s Live API)
Implementation: Using SDK’s aio.live.connect() with text input

Observed issues:

Connection setup: ~1s
Processing: 5.6-9.3s for responses
Model marketed as “very low latency” but showing opposite

4. Voice-Activated Live API (Also slow)

Flow: Browser VAD → gemini-2.5-flash-preview-native-audio-dialog
Latency: Similar 5-10s delays
Note: Direct audio streaming didn’t improve latency significantly

Production Logs Sample

STT: Processing 60053 bytes with model whisper-large-v3-turbo
STT Result: 90 characters transcribed
STT exceeded threshold: 4.34s
SDK processed text to audio for session test-session-123
Gemini Live API exceeded threshold: 6.86s
Pipeline exceeded threshold: 11.20s

Key Questions

Are these latencies expected for preview models? The 4-5s for TTS and 5-9s for Live API seem inconsistent with “low latency” claims.
Is there a specialized TTS endpoint? Currently using the multimodal generate_content() which seems overkill for simple TTS. The AFC overhead appears unavoidable.
Should I be using different models?

Is gemini-2.0-flash-live-001 more stable/faster than the 2.5 preview?
Are the preview models intended for production use at all?

Architecture guidance: For <3s latency requirement, should I:

Continue with Google Cloud TTS until Gemini models mature?
Try a different architecture I haven’t considered?
Wait for production-ready versions (any timeline?)?

Known issues: I’ve found similar reports in GitHub issues #866, #865, and #335 showing ConnectionClosedError: Cannot extract voices from a non-audio request. Is this being addressed?

Technical Details

TTS Configuration:

python

response_modalities=["AUDIO"],
speech_config=types.SpeechConfig(
    voice_config=types.VoiceConfig(
        prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name='Kore')
    )
)

Environment: Python 3.13, google-genai latest, production servers

What I Need

I need to make a decision on whether to:

Wait for Gemini models to improve (if so, what’s the timeline?)
Look for another TTS / realtime model
Implement workarounds I haven’t discovered

Any insights from the community or Google team would be greatly appreciated. I’m particularly interested in hearing from others who’ve successfully deployed Gemini audio models in with low latency requirements.

Thank you!

Peter_Kruck · June 18, 2025, 4:17pm

Hi everyone,

I’m building a VR language learning app that requires both audio output (for voice) and text output (for subtitles and chat history) from the Gemini native audio dialog model. However, I’m getting conflicting information about whether this is supported and how to implement it correctly.

What I Need

Audio output: For natural voice conversation (working )
Text output: For displaying subtitles in real-time and saving chat history
Model: gemini-2.5-flash-preview-native-audio-dialog via Live API

What I’ve Tried

Attempt 1: Dual Modalities

python

config = types.LiveConnectConfig(
    response_modalities=["AUDIO", "TEXT"],  # This causes errors
    speech_config=types.SpeechConfig(...)
)

Result: WebSocket error 1007 “Cannot extract voices from a non-audio request” or complete non-responsiveness

Attempt 2: Audio Only

python

config = types.LiveConnectConfig(
    response_modalities=["AUDIO"],  # Works fine
    speech_config=types.SpeechConfig(...)
)

Result: Audio works perfectly, but no text for subtitles

Conflicting Information

I’ve found conflicting documentation about text output support:

Official docs state the model outputs “Text and audio, interleaved”
GitHub issue #380 suggests using output_audio_transcription:

python

config = types.LiveConnectConfig(
    response_modalities=["AUDIO"],
    output_audio_transcription=types.AudioTranscriptionConfig()
)

Developer reports indicate response_modalities=["AUDIO", "TEXT"] doesn’t work

Questions

What is the correct way to get both audio and text output from gemini-2.5-flash-preview-native-audio-dialog?
Is output_audio_transcription the official method? If so:

Does this provide real-time text synchronized with audio?
Is the text quality suitable for displaying as subtitles?
Can I access the full conversation text for chat history?

Why does response_modalities=["AUDIO", "TEXT"] fail? Is this:

A bug in the SDK?
An API limitation?
Incorrect usage on my part?

Environment

SDK: google-genai (latest)
Python: 3.13

Any clarification on the proper way to get text output for subtitles and chat history would be greatly appreciated.

Thank you!

Peter_Kruck · June 18, 2025, 7:41pm

I found the correct way to get both text and audio output from gemini-2.5-flash-preview-native-audio-dialog.

Working Solution

Don’t use response_modalities=["AUDIO", "TEXT"] - this causes errors.

Instead, use output_audio_transcription:

python

config = types.LiveConnectConfig(
    response_modalities=["AUDIO"],  # Audio only here
    output_audio_transcription=types.AudioTranscriptionConfig()  # This enables text
)

Then handle both outputs in your receive loop:

python

async for response in session.receive():
    # Text transcription
    if response.server_content and response.server_content.output_transcription:
        text = response.server_content.output_transcription.text
        display_subtitles(text)  # Perfect for subtitles!
    
    # Audio data
    if response.server_content and response.server_content.model_turn:
        for part in response.server_content.model_turn.parts:
            if part.audio and part.audio.data:
                play_audio(part.audio.data)

Hope this helps others facing the same issue!

Topic		Replies	Views
Gemini TTS & Live API showing 4-11s latency Gemini API showcase	2	19	June 19, 2025
Will it be possible to receive text and audio data in the multimodal API? Gemini API models , gemini-api	12	694	June 12, 2025
Gemini flash Live API docs chaos sorted out: Documentation api , models	6	425	May 2, 2025
Transcribe text to text and vice versa, speech to speech and image to text in a flutter app using gemini Gemini API	15	619	May 20, 2024
How does one get access to the API for TTS features of Gemini-2.0? Google AI Studio feature_request	8	1347	December 21, 2024

How to get text output from gemini-2.5-flash-preview-native-audio-dialog

Context

Tested Architectures & Results

1. Classic Pipeline (Working)

2. Classic with Gemini TTS (Too slow)

3. Push-to-Talk Gemini Live API (Too slow)

4. Voice-Activated Live API (Also slow)

Production Logs Sample

Key Questions

Technical Details

What I Need

What I Need

What I’ve Tried

Attempt 1: Dual Modalities

Attempt 2: Audio Only

Conflicting Information

Questions

Environment

Working Solution

Related topics