Hi everyone,
I’m building a voice agent using the Gemini Live API
(gemini-live-2.5-flash-native-audio) and noticed a confusing
behavior with input_audio_transcription.
Setup:
- LiveConnectConfig with input_audio_transcription=AudioTranscriptionConfig()
- response_modalities=[“AUDIO”]
- Listening to server_content.input_transcription.text for displaying
user speech in the UI
Issue:
The transcription text does not match what the model actually understood.
Example:
- I said: “My wifi doesn’t work”
- input_transcription.text returned: “My wife isn’t well”
- But the model responded correctly to “My wifi doesn’t work”
So the model understood me perfectly, but the transcription shown
was completely wrong — which is misleading to users.
Questions:
- Is input_audio_transcription a separate ASR model from the
main Gemini model? Is this a known accuracy gap? - Is there a way to get the transcription of what the model
actually heard/processed? - Has anyone else experienced this? Any workarounds?
Thanks!