I am building an AI interview.
I am using web socket endpoint wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent
Model: gemini-2.5-flash-native-audio-preview-12-2025
I am using JAVA and using web socket endpoint I have connected with Gemini v1beta and setting up the config.
My requirement is candidate will come and give the interview flow as below.
- candidate start the interview and then AI will ask the first question(related to skill given as system prompt when setup)
- AI will give audio and transcript.(TTS) audio for listen the question and transcript for display at UI.
- after listening question candidate will speak(answer). then again go to the Gemini and convert audio to transcript(STT).
- using given answer AI will ask the next follow-up question is answer is not proper or some more details needed or ask next question if answer is satisfactory.
- I also want interruption if AI is speaking something and candidate speaks AI audio should be stop immediately. After candidate finished speaking then after AI will speak.
Above is my flow, and want to perform using Live streaming(Gemini Live API).
I have setup the below
{“setup”:{“model”:“models/gemini-2.5-flash-native-audio-preview-12-2025”,“generation_config”:{“temperature”:0.7,“response_modalities”:[“AUDIO”]},“realtimeInputConfig”:{“activityHandling”:“START_OF_ACTIVITY_INTERRUPTS”,“turnCoverage”:“TURN_INCLUDES_ALL_INPUT”,“automaticActivityDetection”:{“disabled”:false,“endOfSpeechSensitivity”:“END_SENSITIVITY_HIGH”,“startOfSpeechSensitivity”:“START_SENSITIVITY_HIGH”,“silence_duration_ms”:700}},“input_audio_transcription”:{},“output_audio_transcription”:{},“system_instruction”:{“parts”:[{“text”:“”}]}}}
When candidate speaks(from browser), transcript is received from Gemini server in any language(Hindi, English, etc) I need to set the “en-IN“ as output transcript as well as input transcript.
I have tried speech config as well
“speech_config”: {
"voice_config": {
"prebuilt_voice_config": {
"voice_name": "Puck"
}
}
},
also given languagecode in “input_audio_transcription”: {“languageCode:” “en-IN“}
but still its not working at all.
Why this happens and how I fix this.

