I’m building a voice tutoring app using Gemini 3.1 Flash Live Preview via raw WebSocket. Text input works perfectly — when I send a clientContent message, Gemini responds with audio. But when I send audio input via realtimeInput.audio, Gemini never responds. I only see sessionResumptionUpdate messages, never modelTurn or turnComplete.
Setup message:
json
{
"setup": {
"model": "models/gemini-3.1-flash-live-preview",
"generation_config": {
"response_modalities": ["AUDIO"],
"speech_config": {
"voice_config": {
"prebuilt_voice_config": { "voice_name": "Leda" }
}
}
},
"system_instruction": { "parts": [{ "text": "..." }] },
"realtime_input_config": {
"automatic_activity_detection": { "disabled": true }
}
}
}
Audio message:
json
{
"realtimeInput": {
"audio": {
"data": "<base64 PCM>",
"mimeType": "audio/pcm;rate=16000"
}
}
}
Audio source: Browser AudioContext at 16kHz, AudioWorklet converting float32 to int16 PCM, chunks of 320 samples (~20ms). Amplitude confirmed above noise floor.
Question: What is the correct format to send audio input that will trigger a spoken response from Gemini 3.1 Flash Live via raw WebSocket? Thank you!