So… I’m using the 2.0 exp model because it’s the only current solution to use the Live API and run a STT and separate TTS realtime solutions. I just got notified 2.0 is going away, which then locks you into the the Google pre-trained voices and Audio only. Does anyone have a solution that I’m not seeing here? Using Cartisia.ai along with the 2.0 Gemini live EXP model has been a phenomenal experience…
Hello Seth! We are not sunsetting the Flash 2.0 models for a few months. We will provide an alternative soluton to use the Live API and run a STT and separate TTS real time solutions.
Could you please let me know your full set of requirements? Thank you
Hi Alisa,
Thank you for reaching out! Here’s our full set of requirements for the Gemini
Live API:
Our Architecture: TEXT Modality + External TTS
We use what we call a “half-cascade” architecture:
1. Gemini Live API handles real-time STT and conversation (TEXT modality only)
2. External TTS (Cartesia, or our custom CosyVoice-based system) handles
speech synthesis with our persona voices
This architecture is critical for us because:
- We have a marketplace of unique AI persona voices that require custom TTS (voice cloning +
21 emotion controls)
-
Gemini’s native audio output doesn’t support the voice customization we need
-
We need fine-grained prosody control (speed, pitch, pauses, emotional arcs)
Why Separation Matters: The “Director” Layer
This is the most important part of our architecture.
Our codebase acts as a director in a play - orchestrating what happens between
understanding (STT) and speaking (TTS). Combining STT and TTS into one black
box would eliminate the intelligence layer that makes our AI feel human.
Here’s what happens in our “director” layer:
Capability: Humanization Engine
What It Does: Adds natural speech patterns, hesitations, self-corrections
Why It Needs TEXT: Modifies text before TTS
────────────────────────────────────────
Capability: Emotional SSML Injection
What It Does: Tags text with , , based on context
Why It Needs TEXT: Needs text to annotate
────────────────────────────────────────
Capability: Memory Prosody
What It Does: Adds pauses when referencing past conversations ("Your dad’s
passing...")
Why It Needs TEXT: Analyzes text for memory triggers
────────────────────────────────────────
Capability: Meaningful Silence
What It Does: Decides when NOT to speak (grief, processing)
Why It Needs TEXT: Intercepts before TTS
────────────────────────────────────────
Capability: Tool Execution
What It Does: Runs functions mid-conversation, injects results
Why It Needs TEXT: Must happen between LLM and TTS
────────────────────────────────────────
Capability: Persona Switching
What It Does: Hands off to different AI team members mid-session
Why It Needs TEXT: Changes voice/personality
────────────────────────────────────────
Capability: Backchanneling
What It Does: Injects “mm-hmm”, “I see” at natural moments
Why It Needs TEXT: Timing-aware text injection
────────────────────────────────────────
Capability: Circadian Adaptation
What It Does: Adjusts speed/warmth based on time of day
Why It Needs TEXT: Modifies prosody parameters
────────────────────────────────────────
Capability: Relationship Stage
What It Does: Voice evolves from formal → intimate over months
Why It Needs TEXT: Long-term state affects TTS
────────────────────────────────────────
Capability: Context Builders
What It Does: 45+ modules inject guidance (emotion, memory, calendar, etc.)
Why It Needs TEXT: Shape LLM output before TTS
────────────────────────────────────────
Capability: Anticipation Pipeline
What It Does: Prepares response prosody during user speech
Why It Needs TEXT: Pre-computes TTS parameters
────────────────────────────────────────
Capability: Sanitization
What It Does: Strips leaked JSON, fixes pronunciation, cleans SSML
Why It Needs TEXT: Must process text before TTS
Our product IS the director layer. The LLM provides understanding; our code
provides humanity.
Specific Features We Rely On
Feature: TEXT modality
Our Usage: All responses come as text, routed through our director layer to
TTS
────────────────────────────────────────
Feature: inputAudioTranscription
Our Usage: We need UserInputTranscribed events for our tool routing system
────────────────────────────────────────
Feature: Built-in VAD/Turn Detection
Our Usage: Fast server-side turn detection (~100-200ms)
────────────────────────────────────────
Feature: Streaming text output
Our Usage: Low-latency text chunks for our processing pipeline
────────────────────────────────────────
Feature: Native function calling
Our Usage: Backup layer for tool execution
What We Need in Any Alternative Solution
1. TEXT-only output mode - We cannot use combined audio output; we need text
to run through our director layer
2. Input audio transcription events - Real-time transcripts during user speech
3. Server-side turn detection - Fast VAD built into the API
4. Streaming support - Low latency is critical for natural conversation
5. Function calling - Native or text-based
Feature Requests: Making Gemini’s Native Audio Competitive
We’ve evaluated alternatives like FlashLabs Chroma (open-source real-time
dialogue with voice cloning) and Cartesia.ai. Here’s what would make Gemini’s
native audio viable for production voice AI:
Feature: Voice cloning
FlashLabs Chroma:
Reference audio prompts
Cartesia:
10-20 sec upload
What We Need from Gemini:
Short reference audio → instant voice
────────────────────────────────────────
Feature: Emotion control
FlashLabs Chroma:
Not documented
Cartesia:
21 emotions
What We Need from Gemini:
Tag-based or instruction-based emotions
────────────────────────────────────────
Feature: SSML/Prosody
FlashLabs Chroma:
Not documented
Cartesia:
Speed, pitch, breaks
What We Need from Gemini:
Accept SSML in audio generation
────────────────────────────────────────
Feature: Streaming latency
FlashLabs Chroma: Real-time (unspecified)
Cartesia: ~200-300ms first byte
What We Need from Gemini: ≤200ms first audio byte
────────────────────────────────────────
Feature: TEXT intermediary
FlashLabs Chroma:
End-to-end only
Cartesia: N/A (TTS only)
What We Need from Gemini:
Critical: TEXT output for our director layer
The ideal Gemini Live API would offer:
1. TEXT modality (keep current!) - For our director layer processing
2. Voice cloning - Upload 10-20 sec reference → custom voice ID
3. Emotion tags in audio output - [happy], [sympathetic], [calm] applied to
native audio
4. SSML support in audio - Our director layer outputs SSML; native audio
should respect it
5. Hybrid mode - TEXT + Audio output simultaneously (we process text, user
hears audio)
Why this matters:
The open-source community (FlashLabs Chroma, CosyVoice, Kokoro) is rapidly
adding voice cloning and personalization. Cartesia has become the de facto
choice for production voice AI because of emotion control + voice cloning. If
Gemini’s native audio supported these features, it could become a true
end-to-end solution - but only if TEXT output remains available for
applications like ours that need the “director” layer.
What Would NOT Work for Us
-
Any model that ONLY supports audio-to-audio (no TEXT intermediary)
-
Any model that doesn’t support TEXT modality with audio input
-
Losing the inputAudioTranscription capability
-
Pre-trained voices only (we need custom persona voices)
Our Current Stack
-
Model: gemini-2.0-flash-exp
-
Client: @livekit/agents-plugin-google
-
Director Layer: millions of lines code supporting of humanization, context builders, SSML
processing
-
TTS: Cartesia.ai (with superhuman capabilities layer)
-
Transport: LiveKit (WebRTC)
We’re happy to provide more technical details, share code examples, or hop on
a call to discuss. We’ve been very happy with the current 2.0 exp model and
want to ensure continuity.
Best,
Seth
I am showing models/gemini-2.0-flash-exp has been removed and is no longer working today can you confirm?
It is showing up in Vertex but not AI Studio, The Live API does not work in Vertex. Please restore the model to AI Studio, you all pulled it last night!!!
Correction gemini-2.0-flash-live-preview-04-09 is the only model now that supports STT and it only is found in Vertex and it feels a lot slower that the exp model. Can we please get a version of the exp model even if it sucks at tool calling and doesn’t have audio integration. And can we get it in AI Studio rather than Vertex… If you provide a Realtime T2S model as well that supports SSML that would be amazing! @Alisa_Fortin let me know who I need to talk to…
Hello Alisa,
We also have a strong requirement for a model that can use the Live API with text output, so that we can continue to rely on external TTS solutions. This point is critical for us.
At the moment, the audio output from the Gemini Live API still has reliability issues, such as frequent misreading of numbers and similar errors. Because of this, the quality and controllability provided by external TTS systems are extremely important for production-grade voice agents.
That said, we are very impressed by the exceptionally fast response times and the high-level speech understanding of the Gemini Live API. If a text modality is available in Live mode, we believe we can fully leverage these strengths while pairing them with a highly reliable external TTS. This combination would enable us to build trustworthy, high-quality voice agents and unlock many valuable use cases.