I created a small public repo to reproduce and document a behavior I identified while testing the Gemini Live API. When sending audio in a continuous streaming session for long periods of time, the input_audio_transcription tends to degrade or eventually stops arriving.
However, when sending the full audio chunk at once (non-streaming) at the end of the user’s speech, the transcriptions remain accurate. This is consistent with what I’m also seeing in AI Studio, so I assumed this might be expected behavior on the current API implementation.
Because of this, I implemented a simple workaround in a workshop project:
every X seconds I perform a “flush” using an activity_end followed immediately by a new activity_start. This effectively closes the turn but prevents the agent from speaking, so the user does not notice any interruption. In practice it resets the transcription pipeline and avoids the degradation during long sessions.
Maybe this approach can help others until the issue is addressed on Google’s side.
Inside the repo I included two files with more details:
-
ARCHITECTURE.md – a description of the current architecture used for the tests
-
TRANSCRIPTION_LIMITS.md – a collection of the transcription limitations I found, comparing continuous streaming vs. sending all audio at end-of-speech
Happy to share the repo if anyone is interested or if it helps the team reproduce the behavior. And again, thank you for the welcome!