WebSocket Error 1007 When Requesting Simultaneous Audio + Text in Gemini Flash Models (AI voice Transcribing issue)

Hello everyone,

I’m documenting a limitation—and the resulting “Invalid Argument” (WebSocket error code 1007)—encountered while building real-time voice applications that require both audio output and text transcription using Gemini Flash models.
This post summarizes the current behavior across model versions and explains why requesting simultaneous AUDIO and TEXT output fails, depending on the model.

Issue Summary
In real-time voice or speech-to-speech scenarios, a common configuration attempt is:
response_modalities: [“AUDIO”, “TEXT”]
At present, this configuration fails across Gemini Flash models, though for different underlying reasons.

Model Behavior Breakdown
Gemini 3.0 Flash / Gemini 3.0 Pro
Status: Audio output not supported
Capabilities:
• Accepts multimodal input (Text, Audio, Video)
• Generates text-only output
Observed Behavior:
• Including “AUDIO” in response_modalities causes the request to be rejected at configuration time
• Results in an immediate WebSocket 1007 (Invalid Argument) error
Implication:
• Native speech-to-speech is not possible
• Audio input can be used, but output must remain text-only
Recommended Use:
• High-speed reasoning
• Coding and structured text generation
• Text-based conversational systems
Voice output requires an external Text-to-Speech (TTS) service.

Gemini 2.5 Flash (Native Audio)
Status: Supports native audio output with a single-modality restriction
Key Limitation:
• The model can generate either AUDIO or TEXT, but not both in the same response
Observed Error:
• Requesting:
• response_modalities: [“AUDIO”, “TEXT”]
results in WebSocket error 1007 (Invalid Argument)
Reason:
• The model does not support splitting generation into multiple creative output streams simultaneously
Current Workaround:
• Request AUDIO-only output
• Retrieve text via the transcription side-channel
This allows audio generation with text available asynchronously, but not as a first-class dual output.

Summary Table
Model Audio Output Text Output Audio + Text Simultaneously
Gemini 3.0 Flash / Pro No Yes No
Gemini 2.5 Flash (Native Audio) Yes Yes No (via response_modalities)

Question for the Team
Is there a roadmap or planned support for true multi-modal generation, where audio and text are produced simultaneously without relying on transcription side-channels?
This capability would be essential for:
• Real-time voice assistants
• Live captioning systems
• Conversational agents with synchronized speech and text
Any clarification or guidance would be greatly appreciated.

Thank you.
Mehedi

1 Like