Hello Google AI team,
After upgrading from Gemini 2.5 to 3.1 Flash in the Live API (WebSocket), I noticed that inputTranscription is no longer delivered incrementally. It is now only delivered after the user finishes the full utterance.
For example, the behavior of gemini live 2.5 will be like:
{“serverContent”: {“inputTranscription”: {“text”: " Ho"}}}
{“serverContent”: {“inputTranscription”: {“text”: “w”}}}
{“serverContent”: {“inputTranscription”: {“text”: " are"}}}
{“serverContent”: {“inputTranscription”: {“text”: " you"}}}
This change impacts my app. My application records audio in short bursts and relies on incremental inputTranscription updates to know if the user is still speaking so it can extend the recording time. Without these real-time chunks, the recording often cuts off prematurely.
For comparison, the OpenAI Realtime API provides input_audio_buffer.speech_started and speech_stopped events for this exact purpose. However, I couldn’t find similar indicators in Gemini 3.1 Flash Live.
How would you recommend handling this change? Is there a new way to detect ongoing speech?
Thanks!