Gemini native audio gemini-live-2.5-flash-preview model Speaking issue (Pause time issue)

Hello everyone,

I am currently developing an AI-based IELTS instructor project. For the “Speaking” module with an AI voice assistant, I am using gemini-live-2.5-flash-preview through Vertex AI Studio with OAuth 2.0 authentication.

I am facing two major implementation issues that I need assistance with:

1. Voice Input Truncation (Aggressive End-of-Turn Detection)

During real-time audio streaming, the model appears to have very aggressive Voice Activity Detection (VAD). It stops listening whenever I take even a small, natural pause. It treats every micro-pause as the end of the user’s turn.

The symptoms:

  • The model sends a premature response while the user is still speaking.

  • The model’s audio output conflicts with the user’s ongoing speech.

  • Full sentences or ideas are not captured because chunks are cut off too early.

I need the model to maintain the listening state until the user intentionally stops speaking, rather than auto-cutting after micro-pauses. Are there specific parameters (e.g., silence_timeout or server-side VAD settings) to handle continuous speech better?

2. Speaking Band Score Bias (Capped at 5.5–6.5)

I am asking the model to evaluate the user’s speaking performance, but the output seems heavily constrained. No matter the quality of the input, the model consistently generates a band score in the range of 5.5 to 6.5.

This pattern persists regardless of the input:

  • Speech with errors: Scores ~6.0

  • Speech with excellent fluency/accuracy: Scores ~6.0

  • Extremely short or long answers: Scores ~6.0

This suggests the scoring behavior might be capped or biased by the model’s safety/prediction guardrails. Is there a recommended prompt structure or temperature configuration to ensure the model utilizes the full scoring range (0–9) dynamically?

Request for Guidance

I would appreciate help with the following:

  • How to properly configure continuous audio streaming so natural pauses do not terminate input.

  • How to adjust or improve the evaluation prompting so the scoring is realistic and not capped.

  • Any best practices or sample JSON configurations for real-time voice tasks with gemini-live-2.5-flash-preview.

Thank you in advance for your help!

Hi @Mehedi_Hasan_Shihab
You are currently using an older model, gemini-live-2.5-flash-preview, which is scheduled for deprecation. The best practice would be to migrate to the newer**gemini-2.5-flash-native-audio-preview-09-2025** model.
To answer your questions

  1. To improve the VAD use the realtime_input_config.automatic_activity_detection field in your session setup message to gain fine-grained control over how the server determines the end of a user’s speech.
    By default, the VAD is set to have a “High” End-of-speech sensitivity which is why small pauses are terminating the turn. You can explicitly set this to a lower sensitivity.
  2. To improve evaluation prompting you can use a robust System Instruction that clearly defines the model’s role, the required output format, and, most importantly, provides exemplary scoring criteria to bypass the internal bias.

You can try these things and let me know if you need any more help .
Thanks