Live Text Modality for STT

So… I’m using the 2.0 exp model because it’s the only current solution to use the Live API and run a STT and separate TTS realtime solutions. I just got notified 2.0 is going away, which then locks you into the the Google pre-trained voices and Audio only. Does anyone have a solution that I’m not seeing here? Using Cartisia.ai along with the 2.0 Gemini live EXP model has been a phenomenal experience…

Hello Seth! We are not sunsetting the Flash 2.0 models for a few months. We will provide an alternative soluton to use the Live API and run a STT and separate TTS real time solutions.

Could you please let me know your full set of requirements? Thank you

Hi Alisa,

Thank you for reaching out! Here’s our full set of requirements for the Gemini

Live API:

Our Architecture: TEXT Modality + External TTS

We use what we call a “half-cascade” architecture:

1. Gemini Live API handles real-time STT and conversation (TEXT modality only)

2. External TTS (Cartesia, or our custom CosyVoice-based system) handles

speech synthesis with our persona voices

This architecture is critical for us because:

  • We have a marketplace of unique AI persona voices that require custom TTS (voice cloning +

21 emotion controls)

  • Gemini’s native audio output doesn’t support the voice customization we need

  • We need fine-grained prosody control (speed, pitch, pauses, emotional arcs)

Why Separation Matters: The “Director” Layer

This is the most important part of our architecture.

Our codebase acts as a director in a play - orchestrating what happens between

understanding (STT) and speaking (TTS). Combining STT and TTS into one black

box would eliminate the intelligence layer that makes our AI feel human.

Here’s what happens in our “director” layer:

Capability: Humanization Engine

What It Does: Adds natural speech patterns, hesitations, self-corrections

Why It Needs TEXT: Modifies text before TTS

────────────────────────────────────────

Capability: Emotional SSML Injection

What It Does: Tags text with , , based on context

Why It Needs TEXT: Needs text to annotate

────────────────────────────────────────

Capability: Memory Prosody

What It Does: Adds pauses when referencing past conversations ("Your dad’s

passing...")                                                                

Why It Needs TEXT: Analyzes text for memory triggers

────────────────────────────────────────

Capability: Meaningful Silence

What It Does: Decides when NOT to speak (grief, processing)

Why It Needs TEXT: Intercepts before TTS

────────────────────────────────────────

Capability: Tool Execution

What It Does: Runs functions mid-conversation, injects results

Why It Needs TEXT: Must happen between LLM and TTS

────────────────────────────────────────

Capability: Persona Switching

What It Does: Hands off to different AI team members mid-session

Why It Needs TEXT: Changes voice/personality

────────────────────────────────────────

Capability: Backchanneling

What It Does: Injects “mm-hmm”, “I see” at natural moments

Why It Needs TEXT: Timing-aware text injection

────────────────────────────────────────

Capability: Circadian Adaptation

What It Does: Adjusts speed/warmth based on time of day

Why It Needs TEXT: Modifies prosody parameters

────────────────────────────────────────

Capability: Relationship Stage

What It Does: Voice evolves from formal → intimate over months

Why It Needs TEXT: Long-term state affects TTS

────────────────────────────────────────

Capability: Context Builders

What It Does: 45+ modules inject guidance (emotion, memory, calendar, etc.)

Why It Needs TEXT: Shape LLM output before TTS

────────────────────────────────────────

Capability: Anticipation Pipeline

What It Does: Prepares response prosody during user speech

Why It Needs TEXT: Pre-computes TTS parameters

────────────────────────────────────────

Capability: Sanitization

What It Does: Strips leaked JSON, fixes pronunciation, cleans SSML

Why It Needs TEXT: Must process text before TTS

Our product IS the director layer. The LLM provides understanding; our code

provides humanity.

Specific Features We Rely On

Feature: TEXT modality

Our Usage: All responses come as text, routed through our director layer to

TTS

────────────────────────────────────────

Feature: inputAudioTranscription

Our Usage: We need UserInputTranscribed events for our tool routing system

────────────────────────────────────────

Feature: Built-in VAD/Turn Detection

Our Usage: Fast server-side turn detection (~100-200ms)

────────────────────────────────────────

Feature: Streaming text output

Our Usage: Low-latency text chunks for our processing pipeline

────────────────────────────────────────

Feature: Native function calling

Our Usage: Backup layer for tool execution

What We Need in Any Alternative Solution

1. TEXT-only output mode - We cannot use combined audio output; we need text

to run through our director layer

2. Input audio transcription events - Real-time transcripts during user speech

3. Server-side turn detection - Fast VAD built into the API

4. Streaming support - Low latency is critical for natural conversation

5. Function calling - Native or text-based

Feature Requests: Making Gemini’s Native Audio Competitive

We’ve evaluated alternatives like FlashLabs Chroma (open-source real-time

dialogue with voice cloning) and Cartesia.ai. Here’s what would make Gemini’s

native audio viable for production voice AI:

Feature: Voice cloning

FlashLabs Chroma: :white_check_mark: Reference audio prompts

Cartesia: :white_check_mark: 10-20 sec upload

What We Need from Gemini: :white_check_mark: Short reference audio → instant voice

────────────────────────────────────────

Feature: Emotion control

FlashLabs Chroma: :cross_mark: Not documented

Cartesia: :white_check_mark: 21 emotions

What We Need from Gemini: :white_check_mark: Tag-based or instruction-based emotions

────────────────────────────────────────

Feature: SSML/Prosody

FlashLabs Chroma: :cross_mark: Not documented

Cartesia: :white_check_mark: Speed, pitch, breaks

What We Need from Gemini: :white_check_mark: Accept SSML in audio generation

────────────────────────────────────────

Feature: Streaming latency

FlashLabs Chroma: Real-time (unspecified)

Cartesia: ~200-300ms first byte

What We Need from Gemini: ≤200ms first audio byte

────────────────────────────────────────

Feature: TEXT intermediary

FlashLabs Chroma: :cross_mark: End-to-end only

Cartesia: N/A (TTS only)

What We Need from Gemini: :white_check_mark: Critical: TEXT output for our director layer

The ideal Gemini Live API would offer:

1. TEXT modality (keep current!) - For our director layer processing

2. Voice cloning - Upload 10-20 sec reference → custom voice ID

3. Emotion tags in audio output - [happy], [sympathetic], [calm] applied to

native audio

4. SSML support in audio - Our director layer outputs SSML; native audio

should respect it

5. Hybrid mode - TEXT + Audio output simultaneously (we process text, user

hears audio)

Why this matters:

The open-source community (FlashLabs Chroma, CosyVoice, Kokoro) is rapidly

adding voice cloning and personalization. Cartesia has become the de facto

choice for production voice AI because of emotion control + voice cloning. If

Gemini’s native audio supported these features, it could become a true

end-to-end solution - but only if TEXT output remains available for

applications like ours that need the “director” layer.

What Would NOT Work for Us

  • Any model that ONLY supports audio-to-audio (no TEXT intermediary)

  • Any model that doesn’t support TEXT modality with audio input

  • Losing the inputAudioTranscription capability

  • Pre-trained voices only (we need custom persona voices)

Our Current Stack

  • Model: gemini-2.0-flash-exp

  • Client: @livekit/agents-plugin-google

  • Director Layer: millions of lines code supporting of humanization, context builders, SSML

processing

  • TTS: Cartesia.ai (with superhuman capabilities layer)

  • Transport: LiveKit (WebRTC)

We’re happy to provide more technical details, share code examples, or hop on

a call to discuss. We’ve been very happy with the current 2.0 exp model and

want to ensure continuity.

Best,

Seth

I am showing models/gemini-2.0-flash-exp has been removed and is no longer working today can you confirm?

It is showing up in Vertex but not AI Studio, The Live API does not work in Vertex. Please restore the model to AI Studio, you all pulled it last night!!!

Correction gemini-2.0-flash-live-preview-04-09 is the only model now that supports STT and it only is found in Vertex and it feels a lot slower that the exp model. Can we please get a version of the exp model even if it sucks at tool calling and doesn’t have audio integration. And can we get it in AI Studio rather than Vertex… If you provide a Realtime T2S model as well that supports SSML that would be amazing! @Alisa_Fortin let me know who I need to talk to…

Hello Alisa,

We also have a strong requirement for a model that can use the Live API with text output, so that we can continue to rely on external TTS solutions. This point is critical for us.

At the moment, the audio output from the Gemini Live API still has reliability issues, such as frequent misreading of numbers and similar errors. Because of this, the quality and controllability provided by external TTS systems are extremely important for production-grade voice agents.

That said, we are very impressed by the exceptionally fast response times and the high-level speech understanding of the Gemini Live API. If a text modality is available in Live mode, we believe we can fully leverage these strengths while pairing them with a highly reliable external TTS. This combination would enable us to build trustworthy, high-quality voice agents and unlock many valuable use cases.