Gemini Live API: input_audio_transcription returns incorrect text while model correctly processes audio

Hi everyone,

I’m building a voice agent using the Gemini Live API
(gemini-live-2.5-flash-native-audio) and noticed a confusing
behavior with input_audio_transcription.

Setup:

  • LiveConnectConfig with input_audio_transcription=AudioTranscriptionConfig()
  • response_modalities=[“AUDIO”]
  • Listening to server_content.input_transcription.text for displaying
    user speech in the UI

Issue:
The transcription text does not match what the model actually understood.

Example:

  • I said: “My wifi doesn’t work”
  • input_transcription.text returned: “My wife isn’t well”
  • But the model responded correctly to “My wifi doesn’t work”

So the model understood me perfectly, but the transcription shown
was completely wrong — which is misleading to users.

Questions:

  1. Is input_audio_transcription a separate ASR model from the
    main Gemini model? Is this a known accuracy gap?
  2. Is there a way to get the transcription of what the model
    actually heard/processed?
  3. Has anyone else experienced this? Any workarounds?

Thanks!