Hello everyone,
I am currently developing an AI-based IELTS instructor project. For the “Speaking” module with an AI voice assistant, I am using gemini-live-2.5-flash-preview through Vertex AI Studio with OAuth 2.0 authentication.
I am facing two major implementation issues that I need assistance with:
1. Voice Input Truncation (Aggressive End-of-Turn Detection)
During real-time audio streaming, the model appears to have very aggressive Voice Activity Detection (VAD). It stops listening whenever I take even a small, natural pause. It treats every micro-pause as the end of the user’s turn.
The symptoms:
-
The model sends a premature response while the user is still speaking.
-
The model’s audio output conflicts with the user’s ongoing speech.
-
Full sentences or ideas are not captured because chunks are cut off too early.
I need the model to maintain the listening state until the user intentionally stops speaking, rather than auto-cutting after micro-pauses. Are there specific parameters (e.g., silence_timeout or server-side VAD settings) to handle continuous speech better?
2. Speaking Band Score Bias (Capped at 5.5–6.5)
I am asking the model to evaluate the user’s speaking performance, but the output seems heavily constrained. No matter the quality of the input, the model consistently generates a band score in the range of 5.5 to 6.5.
This pattern persists regardless of the input:
-
Speech with errors: Scores ~6.0
-
Speech with excellent fluency/accuracy: Scores ~6.0
-
Extremely short or long answers: Scores ~6.0
This suggests the scoring behavior might be capped or biased by the model’s safety/prediction guardrails. Is there a recommended prompt structure or temperature configuration to ensure the model utilizes the full scoring range (0–9) dynamically?
Request for Guidance
I would appreciate help with the following:
-
How to properly configure continuous audio streaming so natural pauses do not terminate input.
-
How to adjust or improve the evaluation prompting so the scoring is realistic and not capped.
-
Any best practices or sample JSON configurations for real-time voice tasks with
gemini-live-2.5-flash-preview.
Thank you in advance for your help!