Gemini native audio gemini-live-2.5-flash-preview model Speaking issue (Pause time issue)

Hello everyone,

I am currently developing an AI-based IELTS instructor project. For the “Speaking” module with an AI voice assistant, I am using gemini-live-2.5-flash-preview through Vertex AI Studio with OAuth 2.0 authentication.

I am facing two major implementation issues that I need assistance with:

1. Voice Input Truncation (Aggressive End-of-Turn Detection)

During real-time audio streaming, the model appears to have very aggressive Voice Activity Detection (VAD). It stops listening whenever I take even a small, natural pause. It treats every micro-pause as the end of the user’s turn.

The symptoms:

  • The model sends a premature response while the user is still speaking.

  • The model’s audio output conflicts with the user’s ongoing speech.

  • Full sentences or ideas are not captured because chunks are cut off too early.

I need the model to maintain the listening state until the user intentionally stops speaking, rather than auto-cutting after micro-pauses. Are there specific parameters (e.g., silence_timeout or server-side VAD settings) to handle continuous speech better?

2. Speaking Band Score Bias (Capped at 5.5–6.5)

I am asking the model to evaluate the user’s speaking performance, but the output seems heavily constrained. No matter the quality of the input, the model consistently generates a band score in the range of 5.5 to 6.5.

This pattern persists regardless of the input:

  • Speech with errors: Scores ~6.0

  • Speech with excellent fluency/accuracy: Scores ~6.0

  • Extremely short or long answers: Scores ~6.0

This suggests the scoring behavior might be capped or biased by the model’s safety/prediction guardrails. Is there a recommended prompt structure or temperature configuration to ensure the model utilizes the full scoring range (0–9) dynamically?

Request for Guidance

I would appreciate help with the following:

  • How to properly configure continuous audio streaming so natural pauses do not terminate input.

  • How to adjust or improve the evaluation prompting so the scoring is realistic and not capped.

  • Any best practices or sample JSON configurations for real-time voice tasks with gemini-live-2.5-flash-preview.

Thank you in advance for your help!

Hi @Mehedi_Hasan_Shihab
You are currently using an older model, gemini-live-2.5-flash-preview, which is scheduled for deprecation. The best practice would be to migrate to the newer**gemini-2.5-flash-native-audio-preview-09-2025** model.
To answer your questions

  1. To improve the VAD use the realtime_input_config.automatic_activity_detection field in your session setup message to gain fine-grained control over how the server determines the end of a user’s speech.
    By default, the VAD is set to have a “High” End-of-speech sensitivity which is why small pauses are terminating the turn. You can explicitly set this to a lower sensitivity.
  2. To improve evaluation prompting you can use a robust System Instruction that clearly defines the model’s role, the required output format, and, most importantly, provides exemplary scoring criteria to bypass the internal bias.

You can try these things and let me know if you need any more help .
Thanks

Hi @Pannaga_J,

Thank you for the detailed guidance and for pointing out the model deprecation notice.

I wanted to follow up and clarify that I have already implemented the suggested changes, but the core issue still persists.

What I’ve implemented so far

Model migration

Migrated from gemini-live-2.5-flash-preview to gemini-2.5-flash-native-audio-preview-09-2025.

VAD configuration

Explicitly configured realtime_input_config.automatic_activity_detection in the session setup.

Lowered the end-of-speech sensitivity from the default High to a lower value to prevent short pauses from terminating the turn.

Verified that the updated configuration is being sent correctly during session initialization.

System instruction / prompting

Added a robust system instruction clearly defining:

The model’s role

Expected output format

Explicit evaluation / scoring criteria with concrete examples to minimize internal bias

Issue still observed

Despite the above changes:

The model still prematurely terminates speech during short or natural pauses.

The behavior appears similar to the default high-sensitivity VAD, even when a lower sensitivity is configured.

In some cases, speech cutoff happens before the user finishes a complete sentence.

Clarification / questions

I’d appreciate some clarification on the following points:

Are there any additional VAD-related parameters (e.g., internal thresholds, silence duration limits, or buffering constraints) that override or interact with automatic_activity_detection?

Is the VAD behavior fully controlled server-side, or are there known limitations where sensitivity settings may not be strictly enforced?

Are there any known issues or expected behaviors with native audio preview models regarding turn termination that we should account for?

Would you recommend any best-practice configuration patterns (or sample session configs) for handling conversational speech with natural pauses?

Thanks again for your support, and I appreciate any additional guidance you can provide.

Best regards,
Mehedi Hasan Shihab