Here are the details regarding the issue. I am using the Gen AI SDK (JavaScript/React) with the gemini-2.0-flash-exp model via the Multimodal Live API (WebSocket).
The Issue:
Even though I am speaking clearly in English, the model frequently transcribes the input as other languages (e.g., Hindi, Welsh, or unrelated characters) and sometimes responds in those languages. This often happens when there is silence or slight background noise.
I attempted to force the input language by setting model: “en-US” inside inputAudioTranscription, but the API throws a validation error (see below).
Code Snippet:
Here is the configuration I am passing to client.live.connect.
codeJavaScript
sessionRef.current = await aiClientRef.current.live.connect({
model: "gemini-2.5-flash-native-audio-preview-12-2025",
config: {
responseModalities: ["AUDIO"], // Using Modality.AUDIO
systemInstruction: {
parts: [{
text: "You are an interviewer. You must listen and respond in English."
}]
},
// The issue occurs regardless of tools, but here is the setup:
tools: [{
functionDeclarations: [{
name: 'end_interview',
description: 'Ends the interview session.',
parameters: {
type: 'object',
properties: { reason: { type: 'string' } },
required: ['reason']
}
}]
}],
// ATTEMPTED FIX:
// When I leave this empty {}, it auto-detects (poorly).
// When I try to set { model: "en-US" }, it crashes.
inputAudioTranscription: {
// model: "en-US" // <-- This causes Invalid JSON payload error
},
speechConfig: {
voiceConfig: {
prebuiltVoiceConfig: {
voiceName: "Despina",
},
},
},
realtimeInputConfig: {
automaticActivityDetection: {
disabled: false,
startOfSpeechSensitivity: "START_SENSITIVITY_LOW",
endOfSpeechSensitivity: "END_SENSITIVITY_LOW",
prefixPaddingMs: 20,
silenceDurationMs: 3000,
},
},
}
});
The Error:
When I try to define the model in inputAudioTranscription to fix the detection issue, I receive:
Invalid JSON payload received. Unknown name “model” at ‘setup.input_audio_transcription’: Cannot find field.
Steps to Reproduce:
-
Connect to the Live API using the config above.
-
Stream audio chunks from the browser microphone (I am using Int16Array PCM).
-
Speak a short English phrase or leave a moment of silence.
-
Observe the serverContent transcription events; they often switch to random languages instead of staying in English.
Is there a supported parameter to strictly enforce the Input Language for the Live API to prevent these hallucinations?
Thanks!