Hi everyone,
I am wanting to using the new realtime api on the Gemini API, but I am unable to receive text AND audio, similar to how it works on Google AI studio. Is this not possible at the moment? Currently, I am using this code:
from google import genai
import os
os.environ['GOOGLE_API_KEY'] = "..."
client = genai.Client(http_options= {'api_version': 'v1alpha'})
MODEL = "gemini-2.0-flash-exp"
import asyncio
import base64
import contextlib
import datetime
import os
import json
import wave
import itertools
from IPython.display import display, Audio
from google import genai
from google.genai import types
config={
"generation_config": {
"response_modalities": ["AUDIO", "TEXT"],
"temperature": 0.65
}
}
async with client.aio.live.connect(model=MODEL, config=config) as session:
message = "Hello? Gemini are you there?"
print("> ", message, "\n")
await session.send(message, end_of_turn=True)
# For text responses, When the model's turn is complete it breaks out of the loop.
turn = session.receive()
async for chunk in turn:
if chunk.text is not None:
print(f'{chunk.text}', end="")
else:
print(chunk.text)
print(chunk.server_content.model_turn.parts[0])
and get back:
ConnectionClosedError: received 1007 (invalid frame payload data) Request trace id: b7ac8bb69f1977ce, [ORIGINAL ERROR] generic::invalid_argument: Error in program Instantiation for language; then sent 1007 (invalid frame payload data) Request trace id: b7ac8bb69f1977ce, [ORIGINAL ERROR] generic::invalid_argument: Error in program Instantiation for language
I am unsure what this means, but assume that it means that this is not supported? If anyone from the Gemini team or anyone else can fill me in, it would be greatly appreciated.
Thanks for a great model!
Landon
Welcome to the forum.
The Live API works, but when you use the genai
import, you are using the v1beta
endpoint; and Live API isnât there yet. To get started, check this cookbook - Google Colab
You will notice that it sets URI to a v1alpha
endpoint.
Hope that helps.
Doesnât this line use the alpha endpoint?:
client = genai.Client(http_options= {'api_version': 'v1alpha'})
My code will work if I just have response_modalities
as ["AUDIO"]
, my question is around whether you can have both ["TEXT", "AUDIO"]
, similar to how openai handles it.
1 Like
Good point. Thereâs more that changed in the URI, there is a /ws/ prefix (indicates the Live API is a webservice). I basically think the genai wrapper is not yet compatible with the endpoint, and I have to admit thatâs just a guess.
The demo code in the cookbook just uses the most direct programming approach possible, and that works.
Hope that helps.
I noticed that when I do talk to Gemini 2.0 in the lab, it provides the text together with the audio. So the crash of the API interface probably just is a bug.
I just wonder how they make it work⌠Because in https://aistudio.google.com/live I get the text together with the synthesized voice data. Getting the phonemes or at least the text together with the audio response would be really helpful for syncing the voice with the lip movement of avatars, for instance from Ready Player Me, using Three JS.
I guess thatâs the question we both have.
I have the same doubt, right now it works when I set
âresponse_modalitiesâ: [âTEXTâ]
or
âresponse_modalitiesâ: [âAUDIOâ]
But it doesnât work for
âresponse_modalitiesâ: [âAUDIOâ,âTEXTâ]
Is there any possible way to get both AUDIO and TEXT responses back from the model?
I have the same question.
This is what I get from API:
interface GenerativeContentBlob {
mimeType: string;
data: string;
}
So it seems API does not return transcript, only audio.
I think of sending audio to some audio-to-text API to get transcript but I donât like this idea. OpenAIâs API just returns transcript together with audio.
Have you found a more elegant solution?