The new “Native Audio” Model is worse at tool calling when compared the MUCH better gemini-live-2.5-flash-preview model, The old model used to be able to do function/tool calling 90-100% of the time whereas the new model struggles most the time, and hallucinates saying that it did use the function call when infact it did not, I see no logical reason to remove the OLDER but not by much OLDER model and just leaving users with this “newer” model that doesn’t even support TEXT only modality, only audio output, It’s just overall worse at the moment until its refined, Especially considering the amount of people’s services that was dependent on the older model as it was more Reliable and Stable
We appreciate you taking the time to share your thoughts with us, your feedback is invaluable as we work to continuously improve the Gemini API experience.
Request: Please Reactivate gemini-live-2.5-flash-preview as a GA Model
The deprecation of gemini-live-2.5-flash-preview has significantly impacted our production workflows. This model was unmatched for real-time transcription and translation use cases.
Why gemini-live-2.5-flash-preview was essential:
-
Superior speed for live audio processing
-
Excellent quality for simple text outputs
-
Cost-effective for production workloads
-
Ideal for the Live API’s streaming capabilities
Issues with gemini-2.5-flash-native-audio-preview:
-
Noticeably slower response times
-
Degraded performance for straightforward text transcription
-
Higher costs that make it unsuitable for many production applications
Our request: Please consider reactivating gemini-live-2.5-flash-preview alongside the newer model. Many developers in the community are facing the same challenges. Since the infrastructure already exists, offering both options would give developers the flexibility to choose the right tool for their specific use cases without forcing migration to a model that doesn’t meet their requirements.
Hey!
Just wanted to upvote this comment.
I really don’t understand why the support for text output was removed… This may be subjective, but from my point of view, speech-to-speech models still have to mature a bit more to be used in production (at least in the context I’m working on). The half-cascaded architecture that Google used to describe in their LiveAPI documentation was, for me, the current sweet spot. I didn’t need to use separate models (VAD + ASR + LLM + TTS) because the model would handle the VAD + ASR + LLM. The experience was really smooth and I could better control the output audio, even use my own cloned voice.
I’ll probably switch to OpenAI’s models or even to a service like Ultravox that does just that.
Also, just a couple of months ago, Google stated in their documentation that “It [half-cascaded architecture] offers better performance and reliability in production environments, especially with tool use.“ when comparing the half-cascaded audio with native audio. I didn’t find any information in their documentation about what changed and this claim was simply removed from the documentation, as far as I could tell.
Any updates on bringing back a Half Cascade model?