Live API get output transcription timestamp

Hi,

I’m using Gemini live in all my project due to it’s quality and speed.

For one of my project, I have a system prompt to ask Gemini to give me an answer to my question with also also audio description part like “a man is looking through the window”. I need the response modality to be audio and text. To do so, I use the response modality “Audio” and “output_audio_transcription”.
Next I need the user to only hear the part of the audio containing the ai response to the input by removing the audio description part. And I use the transcription to extract the audio description for further processing.

Unfortunately, the output transcription chunks are not aligned with the audio chunks and there is no timestamp. For now, I use whisper to give me a transcription with text aligned with timestamp of each word, but its slowdown the process and remove realtime aspect of using Gemini live.

Do you think it is possible to add a “start” and “end” timestamp aligned with the output audio directly with the “output_audio_transcription” ?
It will be a very helpful feature and not so complicated to make I think.

Hello

Welcome to the Forum,

To understand your issue better could you please share some more details like which model you are using, are you facing this issue with Gemini API or AI studio and which framework are you using?

Hi,
Thank you.

First, it is more like a feature request than an issue. I’m using “gemini-2.0-flash-live-001” but I have the same request for all the flash-live models like “gemini-2.5-flash-preview-native-audio-dialog” also.
And I need this feature for Gemini-API alongside the “output_audio_transcription” feature.
This is an example of the api output with output transcription activated:

setup_complete=None server_content=LiveServerContent(model_turn=None, turn_complete=None, interrupted=None, grounding_metadata=None, generation_complete=None, input_transcription=None, output_transcription=Transcription(text='Je ', finished=None), url_context_metadata=None) tool_call=None tool_call_cancellation=None usage_metadata=None go_away=None session_resumption_update=None

The goal is to have the output transcription part “output_transcription=Transcription(text='Je ', finished=None)” aligned with word timestamp like so :
=> “output_transcription=Transcription(text='Je ', timestamp: (0.44, 0.52), finished=None)” (a tuple like whisper turbo)
or
=> “output_transcription=Transcription(text='Je ', start=0.44, end=0.52, finished=None)”

With “start” and “end” representing the start and end of the word in the speech output.