Inconsistent Response Behavior in gemini-2.5-flash-native-audio-preview-09-2025 Voicebot

Hi everyone,
I’m building a real-time Hebrew voicebot using the gemini-2.5-flash-native-audio-preview-09-2025 model, and I’m running into inconsistent behavior that I can’t fully explain.

The issue:
Sometimes the model simply doesn’t answer at all. The bot receives the audio input, but there’s no response from Gemini. After several attempts (sometimes 3–5 retries), it suddenly responds normally. Other times, the entire flow works perfectly from the first message, without any delays or failures.

What I’ve confirmed so far:
• The audio stream is being sent correctly
• The STT + request payload is valid
• No errors are returned from the API
• The problem is intermittent and unpredictable
• When it works, it works flawlessly

What I’m trying to understand:
• Is this a known issue with the current preview model?
• Are there recommended settings, timeouts, or event-handling mechanisms to improve stability?
• Could this be related to rate limits, streaming configuration, or model warm-up behavior?
• Is there any diagnostic logging I should enable to better understand the silent failures?

11 Likes

Hi!

I’m totally backing this, we’re facing the same issue on our own telephony integration and it’s getting really frustrating in production use.

On top of that, I’d like to add that this inconsistency extends to tool use as well, i.e.:

  • The model sometimes doesn’t execute a tool when it should (on average 7/10 a tool call is executed and 3/10 it’s ignored);
  • The model acknowledges the need of invoking a tool to the user, as described in the system instructions (e.g., “I’ll looking into that, give me a brief moment please…”, etc.), but doesn’t use any tool afterwards, it only remains silent until the user speaks again to trigger inference from the model;
  • The model made correctly a tool call, but remains silent and doesn’t communicate any result to the user until the user speaks again to trigger inference from the model (we have correctly verified that we’re sending tool call results to the model).

(Please note that all of the expected behavior described above, which the model doesn’t follow, is thoroughly and clearly described in the system instructions)

Lastly, we have tried playing with the temperatures and proactive dialog settings but this inconsistency is still there and clearly an important issue.

Thanks to Google for paying attention to this matter.

Cheers

2 Likes

Hi,

I’m backing this too, we are currently using the gemini-2.5-flash-live-preview with certain stability in production, while our tests with the native audio model have a lot of latency and voice generation issues.

Unfortunately, this model will be deprecated on December 9, and we will be forced to use the native model that does not have the quality expected. And, as i far as I know there won’t be any extension of the gemini-2.5-flash-live-preview, and there hasn’t been any updates to the native-09-preview model.

1 Like

Hi!

We’ve been seeing the same behavior on our side as well. In our case, the issues became much more stable after disabling Gemini’s built-in VAD and switching to a manual VAD pipeline, and then making sure we explicitly mark start_activity and end_activity boundaries with correct timing. Without tight control of activity windows, we noticed the model would either stay silent or fail to resume properly after long user turns.

On a related note, I’ll leave here a post + repo where I documented the limitations we found around long live sessions, context growth, and how Gemini behaves when the session gets too long (including cases where responses stop arriving or the model interrupts itself). It might help others facing the same symptoms until the preview model becomes more stable.

NOTE: This is a temporary solution intended to stabilize production apps until the core “ghosting” issues in the native-audio stream are patched.

Hi everyone,

I’ve been testing a workaround for the native-audio-preview instability, and I can confirm that moving the VAD (Voice Activity Detection) to the client-side resolves the “ghosting” and tool-use failures.

The Fix: Instead of relying on the Realtime API’s open stream to detect “End of Turn” (which seems to be the point of failure), I implemented a client-side buffer.

  1. Local VAD: App listens to the mic → Detects Silence → Cuts the recording.

  2. Batch Send: Uploads the audio buffer as a single input to the gemini-2.5-flash-native-audio model.

Test Results:

• Success Rate: 100% response (No ghosting).

• Baseline Latency: ~3.5s (average round-trip on default settings).

• Accuracy: The model correctly identifies raw audio nuances because it receives the full context at once.

Optimization Possibility (~1.75s - 2s): My initial tests were conservative. You can likely get the latency down to under 2 seconds if you:

• Force 16kHz Sampling: The Flash model is optimized for 16kHz. Recording at this rate reduces the payload size by ~33% compared to 24k/44k, speeding up the upload significantly.

• Aggressive VAD: Tightening the silence timeout (e.g., from 1.5s down to 0.6s) makes the “turn” feel much snappier.

Why do this? While this introduces a small delay compared to true streaming, it is infinitely better than the “3-5 retries” or total silence currently plaguing the streaming endpoint. It forces the model to treat the input as a complete thought, which drastically improves tool-calling reliability.

If you are stuck before the Dec 9th deprecation, try buffering your audio locally and sending it as a “committed” turn until the devs fix the main cause.

     Hope it helps, ~Cmd.Proton

I’m wondering if anyone tried if Provisioned Throughput through Vertex AI fixes this problem?