The Main Issue
I need to implement push-to-talk (PTT) with interruption support using:
- External ASR (separate Speech-to-Text model) for audio processing
- Live API for text-based conversation with audio responses
The core problem is handling interruptions properly. Users need to be able to interrupt mid-response and immediately continue the conversation, but I’m facing WebSocket protocol violations.
Why External ASR / STT Instead of Full Live API with VAD?
I specifically want PTT with interruption over full Live API with VAD because of various reasons.
Proposed Architecture
User Audio → External ASR → Live API Text Request → Audio Response Stream
↑
INTERRUPTION PROBLEM
The Interruption Problem
The Live API documentation shows interruption patterns using manual activity detection with raw audio:
config = {
"response_modalities": ["AUDIO"],
"realtime_input_config": {"automatic_activity_detection": {"disabled": True}},
}
# Send raw audio with activity signals for interruption
await session.send_realtime_input(activity_start=types.ActivityStart())
await session.send_realtime_input(audio=audio_blob)
await session.send_realtime_input(activity_end=types.ActivityEnd()) # For interruption
But this requires sending raw audio to Live API, not text from external ASR.
My Failed Attempt
I tried implementing interruption support by sending explicit signals to the Live API session.
Normal conversation works fine: I send transcribed text from external ASR to Live API, it responds with streaming audio.
# This works perfectly
await session.send_client_content(
turns=types.Content(role="user", parts=[types.Part(text=transcribed_text)]),
turn_complete=True
)
For interruptions: When users interrupt mid-response, I attempted to signal this by sending just turn_complete=True
, thinking it would tell the API to stop generating.
# For interruption, I tried this
await session.send_client_content(turn_complete=True) # ❌ Causes WebSocket error 1007
Result: This immediately corrupts the session with 1007 (invalid frame payload data)
error. The entire session becomes unusable and must be recreated.
The issue: Sending turn_complete=True
without proper content violates the Live API’s message format, but I can’t find documentation on the correct interruption approach for text-based conversations.
Core Questions About Interruption
-
Can I implement proper PTT interruption with external ASR + Live API text requests?
-
How should interruptions be handled in text-based Live API conversations?
- Can I send
activityEnd
signals even when using text requests? - Is there a proper way to signal turn completion for interruption?
- Should I rely on natural interruption when sending new requests?
- Can I send
-
Am I forced to use Live API’s built-in ASR for proper interruption support?
Environment
- SDK:
google-genai
Python SDK v1.24.0 - Model:
gemini-2.5-flash-preview-native-audio-dialog
- Use Case: Real-time conversational AI with interruption
Request for Guidance
The documentation seems to assume raw audio input for interruption handling. Is it possible to implement proper PTT interruption with external ASR + Live API text requests? If so, what’s the correct approach?
I specifically want PTT with interruption over full Live API with VAD due to my architecture requirements. Any insights would be greatly appreciated!
References: