Gemini Live Api gemini-2.5-flash-native-audio-preview-12-2025

I am building an AI interview.
I am using web socket endpoint wss://generativelanguage.googleapis.com/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent
Model: gemini-2.5-flash-native-audio-preview-12-2025

I am using JAVA and using web socket endpoint I have connected with Gemini v1beta and setting up the config.
My requirement is candidate will come and give the interview flow as below.

  1. candidate start the interview and then AI will ask the first question(related to skill given as system prompt when setup)
  2. AI will give audio and transcript.(TTS) audio for listen the question and transcript for display at UI.
  3. after listening question candidate will speak(answer). then again go to the Gemini and convert audio to transcript(STT).
  4. using given answer AI will ask the next follow-up question is answer is not proper or some more details needed or ask next question if answer is satisfactory.
  5. I also want interruption if AI is speaking something and candidate speaks AI audio should be stop immediately. After candidate finished speaking then after AI will speak.
    Above is my flow, and want to perform using Live streaming(Gemini Live API).

I have setup the below
{“setup”:{“model”:“models/gemini-2.5-flash-native-audio-preview-12-2025”,“generation_config”:{“temperature”:0.7,“response_modalities”:[“AUDIO”]},“realtimeInputConfig”:{“activityHandling”:“START_OF_ACTIVITY_INTERRUPTS”,“turnCoverage”:“TURN_INCLUDES_ALL_INPUT”,“automaticActivityDetection”:{“disabled”:false,“endOfSpeechSensitivity”:“END_SENSITIVITY_HIGH”,“startOfSpeechSensitivity”:“START_SENSITIVITY_HIGH”,“silence_duration_ms”:700}},“input_audio_transcription”:{},“output_audio_transcription”:{},“system_instruction”:{“parts”:[{“text”:“”}]}}}

When candidate speaks(from browser), transcript is received from Gemini server in any language(Hindi, English, etc) I need to set the “en-IN“ as output transcript as well as input transcript.
I have tried speech config as well
“speech_config”: {

        "voice_config": {

            "prebuilt_voice_config": {

                "voice_name": "Puck"

            }

        }

    },

also given languagecode in “input_audio_transcription”: {“languageCode:” “en-IN“}
but still its not working at all.

Why this happens and how I fix this.

I asked Gemini, this is what it said:

Syntax & Formatting: The "languageCode" key in your JSON had a colon inside the quotes, causing the configuration to be ignored. Ensure keys are strictly formatted as "key": "value".

  • Missing Transcripts: By setting response_modalities to only ["AUDIO"], you blocked the model from sending the text you need for your UI. Adding "TEXT" to the modalities allows for simultaneous audio and transcript delivery.

  • Language Enforcement: The transcription config dictates how the AI hears, but it doesn’t force how it speaks. You must use the System Instruction to mandate that the AI responds in English with “en-IN” nuances.

  • Passive Start: The API waits for user input by default. To make the AI ask the first question, your Java client must send an initial “trigger” message (e.g., “Start interview”) immediately after the setup is complete.

  • Interruption Handling: While the server-side interruption is enabled via START_OF_ACTIVITY_INTERRUPTS, your client-side application must actively clear its own audio playback buffer the moment it detects the candidate has started speaking.

Does this help?

“input_audio_transcription”: {“languageCode:” “en-IN“}
that is typo. I have added correct json.
response_modalities ["TEXT", "AUDIO"] is not supported. I have attached reference image.

Language Enforcement: is there any option to set language? when candidate speaks I need transcription in English. when candidate speaks “Hello, How are you?“ Gemini returns transcript in another language. I have already enforce in system prompt. refer below prompt image.