Live API - PTT with external STT & Interruptions

The Main Issue

I need to implement push-to-talk (PTT) with interruption support using:

  • External ASR (separate Speech-to-Text model) for audio processing
  • Live API for text-based conversation with audio responses

The core problem is handling interruptions properly. Users need to be able to interrupt mid-response and immediately continue the conversation, but I’m facing WebSocket protocol violations.

Why External ASR / STT Instead of Full Live API with VAD?

I specifically want PTT with interruption over full Live API with VAD because of various reasons.

Proposed Architecture

User Audio → External ASR → Live API Text Request → Audio Response Stream
                                     ↑
                              INTERRUPTION PROBLEM

The Interruption Problem

The Live API documentation shows interruption patterns using manual activity detection with raw audio:

config = {
    "response_modalities": ["AUDIO"],
    "realtime_input_config": {"automatic_activity_detection": {"disabled": True}},
}

# Send raw audio with activity signals for interruption
await session.send_realtime_input(activity_start=types.ActivityStart())
await session.send_realtime_input(audio=audio_blob)
await session.send_realtime_input(activity_end=types.ActivityEnd())  # For interruption

But this requires sending raw audio to Live API, not text from external ASR.

My Failed Attempt

I tried implementing interruption support by sending explicit signals to the Live API session.

Normal conversation works fine: I send transcribed text from external ASR to Live API, it responds with streaming audio.

# This works perfectly
await session.send_client_content(
    turns=types.Content(role="user", parts=[types.Part(text=transcribed_text)]),
    turn_complete=True
)

For interruptions: When users interrupt mid-response, I attempted to signal this by sending just turn_complete=True, thinking it would tell the API to stop generating.

# For interruption, I tried this
await session.send_client_content(turn_complete=True)  # ❌ Causes WebSocket error 1007

Result: This immediately corrupts the session with 1007 (invalid frame payload data) error. The entire session becomes unusable and must be recreated.

The issue: Sending turn_complete=True without proper content violates the Live API’s message format, but I can’t find documentation on the correct interruption approach for text-based conversations.

Core Questions About Interruption

  1. Can I implement proper PTT interruption with external ASR + Live API text requests?

  2. How should interruptions be handled in text-based Live API conversations?

    • Can I send activityEnd signals even when using text requests?
    • Is there a proper way to signal turn completion for interruption?
    • Should I rely on natural interruption when sending new requests?
  3. Am I forced to use Live API’s built-in ASR for proper interruption support?

Environment

  • SDK: google-genai Python SDK v1.24.0
  • Model: gemini-2.5-flash-preview-native-audio-dialog
  • Use Case: Real-time conversational AI with interruption

Request for Guidance

The documentation seems to assume raw audio input for interruption handling. Is it possible to implement proper PTT interruption with external ASR + Live API text requests? If so, what’s the correct approach?

I specifically want PTT with interruption over full Live API with VAD due to my architecture requirements. Any insights would be greatly appreciated!


References: