I’m trying to add transcription to the Gemini Live demo code here. According to Google’s official capability: Live API capabilities guide | Gemini API | Google AI for Developers
But the transcription is a mess, like below. Am I missing anything? Any extra flags to set?
[Model Transcript]: Ca
[Model Transcript]: n I
[Model Transcript]: pl
[Model Transcript]: eas
[Model Transcript]: e h
[Model Transcript]: ave
[Model Transcript]: yo
[Model Transcript]: ur
[Model Transcript]: acc
[Model Transcript]: oun
[Model Transcript]: t n
[Model Transcript]: umb
[Model Transcript]: er
The behavior you’re seeing is expected for a streaming API. To provide real-time feedback, the API sends back interim (partial) transcripts as it processes the audio. You are printing every one of these partial results.
To fix this, you need to filter the responses and only use the transcript when the API flags it as final.
Ok how to do mark it as final?
Yeah it’s useless right now. We just need to print it when it ends
Based on your request, it seems you want to “mark” the previous response as “final” or indicate that the conversation has concluded, and you also mention “it’s useless right now. We just need to print it when it ends.”
Could you please clarify what you mean by “mark it as final”? Are you:
- Requesting a specific output format? For example, you want me to add a phrase like “[END OF RESPONSE]” or “[FINAL]” to the end of my replies.
- Trying to end the current conversation? In this case, you can simply stop asking questions.
- Referring to a feature or command for a specific application or process? If so, please provide more context about the application you are using.
The phrase “We just need to print it when it ends” suggests you might be part of a larger process where my response is an interim step, and the final output is generated later.
Please provide more detail about what you are trying to achieve so I can give you a more accurate and helpful response.
This is what i mean. How can we do that?
Hello
Welcome to the forum!!
I ran into this too. Since the API streams the text bit-by-bit for speed, you just need to buffer those fragments in a variable and only print the result when the API sends the turn_complete signal.
Here is the code snippet
import asyncio
import google.genai as genai
from google.colab import userdata
Initialize Client
client = genai.Client(
## api_key=userdata.get(“your Key”),
http_options={“api_version”: “v1alpha”}
)
async def main():
model_id = “gemini-2.0-flash-exp”
# output_audio_transcription is required to receive text chunks
config = {“response_modalities”: [“AUDIO”], “output_audio_transcription”: {}}
async with client.aio.live.connect(model=model_id, config=config) as session:
# Send a prompt to trigger audio
await session.send(input="Can I please have your account number?", end_of_turn=True)
full_transcript = ""
async for response in session.receive():
server_content = response.server_content
if server_content is None: continue
# 1. Accumulate text chunks silently (don't print partials)
if server_content.output_transcription:
full_transcript += server_content.output_transcription.text
# 2. Print ONLY when the turn is complete
if server_content.turn_complete:
print(f"Final Transcript: {full_transcript}")
full_transcript = ""
await main()
Thanks