Live API get output transcription timestamp

MasterChabino · July 3, 2025, 7:53am

Hi,

I’m using Gemini live in all my project due to it’s quality and speed.

For one of my project, I have a system prompt to ask Gemini to give me an answer to my question with also also audio description part like “a man is looking through the window”. I need the response modality to be audio and text. To do so, I use the response modality “Audio” and “output_audio_transcription”.
Next I need the user to only hear the part of the audio containing the ai response to the input by removing the audio description part. And I use the transcription to extract the audio description for further processing.

Unfortunately, the output transcription chunks are not aligned with the audio chunks and there is no timestamp. For now, I use whisper to give me a transcription with text aligned with timestamp of each word, but its slowdown the process and remove realtime aspect of using Gemini live.

Do you think it is possible to add a “start” and “end” timestamp aligned with the output audio directly with the “output_audio_transcription” ?
It will be a very helpful feature and not so complicated to make I think.

Lalit_Kumar · July 4, 2025, 6:38am

Hello

Welcome to the Forum,

To understand your issue better could you please share some more details like which model you are using, are you facing this issue with Gemini API or AI studio and which framework are you using?

MasterChabino · July 4, 2025, 10:21am

Hi,
Thank you.

First, it is more like a feature request than an issue. I’m using “gemini-2.0-flash-live-001” but I have the same request for all the flash-live models like “gemini-2.5-flash-preview-native-audio-dialog” also.
And I need this feature for Gemini-API alongside the “output_audio_transcription” feature.
This is an example of the api output with output transcription activated:

setup_complete=None server_content=LiveServerContent(model_turn=None, turn_complete=None, interrupted=None, grounding_metadata=None, generation_complete=None, input_transcription=None, output_transcription=Transcription(text='Je ', finished=None), url_context_metadata=None) tool_call=None tool_call_cancellation=None usage_metadata=None go_away=None session_resumption_update=None

The goal is to have the output transcription part “output_transcription=Transcription(text='Je ', finished=None)” aligned with word timestamp like so :
=> “output_transcription=Transcription(text='Je ', timestamp: (0.44, 0.52), finished=None)” (a tuple like whisper turbo)
or
=> “output_transcription=Transcription(text='Je ', start=0.44, end=0.52, finished=None)”

With “start” and “end” representing the start and end of the word in the speech output.

MasterChabino · July 9, 2025, 7:38am

@Lalit_Kumar, do you need more information ? And do you think this feature can be added to your roadmap ?

Lalit_Kumar · July 14, 2025, 9:36am

Hi,

We have forwarded your issue to our internal team.

Thank you for your patience.

Topic		Replies	Views
Gemini Pro Timestamp Accuracy Issues in Audio Transcription Gemini API gemini-15 , api	9	736	March 27, 2025
Timestamp generation (Forced Alignment) on 2.0 production models is still broken Gemini API models , audio	12	434	August 4, 2025
Gemini Flash 2.0 audio transcription timestamps incorrect Gemini API audio	4	729	March 27, 2025
Transcribing calls with Gemini - labelling speakers wrong Gemini API gemini	3	228	October 25, 2024
Will it be possible to receive text and audio data in the multimodal API? Gemini API models , gemini-api	13	898	July 22, 2025

Live API get output transcription timestamp

Related topics