Will it be possible to receive text and audio data in the multimodal API?

Landon_Garrison · December 16, 2024, 12:09am

Hi everyone,

I am wanting to using the new realtime api on the Gemini API, but I am unable to receive text AND audio, similar to how it works on Google AI studio. Is this not possible at the moment? Currently, I am using this code:

from google import genai
import os

os.environ['GOOGLE_API_KEY'] = "..."
client = genai.Client(http_options= {'api_version': 'v1alpha'})

MODEL = "gemini-2.0-flash-exp"

import asyncio
import base64
import contextlib
import datetime
import os
import json
import wave
import itertools

from IPython.display import display, Audio

from google import genai
from google.genai import types

config={
    "generation_config": {
        "response_modalities": ["AUDIO", "TEXT"],
        "temperature": 0.65
    }
}

async with client.aio.live.connect(model=MODEL, config=config) as session:
  message = "Hello? Gemini are you there?"
  print("> ", message, "\n")
  await session.send(message, end_of_turn=True)

  # For text responses, When the model's turn is complete it breaks out of the loop.
  turn = session.receive()
  async for chunk in turn:
    if chunk.text is not None:
      print(f'{chunk.text}', end="")

    else:
      print(chunk.text)
      print(chunk.server_content.model_turn.parts[0])

and get back:

ConnectionClosedError: received 1007 (invalid frame payload data) Request trace id: b7ac8bb69f1977ce, [ORIGINAL ERROR] generic::invalid_argument: Error in program Instantiation for language; then sent 1007 (invalid frame payload data) Request trace id: b7ac8bb69f1977ce, [ORIGINAL ERROR] generic::invalid_argument: Error in program Instantiation for language

I am unsure what this means, but assume that it means that this is not supported? If anyone from the Gemini team or anyone else can fill me in, it would be greatly appreciated.

Thanks for a great model!

Landon

OrangiaNebula · December 16, 2024, 4:59am

Welcome to the forum.
The Live API works, but when you use the genai import, you are using the v1beta endpoint; and Live API isn’t there yet. To get started, check this cookbook - Google Colab
You will notice that it sets URI to a v1alpha endpoint.

Hope that helps.

Landon_Garrison · December 16, 2024, 5:28am

Doesn’t this line use the alpha endpoint?:

client = genai.Client(http_options= {'api_version': 'v1alpha'})

My code will work if I just have response_modalities as ["AUDIO"], my question is around whether you can have both ["TEXT", "AUDIO"], similar to how openai handles it.

OrangiaNebula · December 16, 2024, 6:27am

Good point. There‘s more that changed in the URI, there is a /ws/ prefix (indicates the Live API is a webservice). I basically think the genai wrapper is not yet compatible with the endpoint, and I have to admit that‘s just a guess.

The demo code in the cookbook just uses the most direct programming approach possible, and that works.

Hope that helps.

David_Lanzendorfer · December 24, 2024, 3:14am

I noticed that when I do talk to Gemini 2.0 in the lab, it provides the text together with the audio. So the crash of the API interface probably just is a bug.
I just wonder how they make it work… Because in https://aistudio.google.com/live I get the text together with the synthesized voice data. Getting the phonemes or at least the text together with the audio response would be really helpful for syncing the voice with the lip movement of avatars, for instance from Ready Player Me, using Three JS.
I guess that’s the question we both have.

Sandeep_Khanna · December 25, 2024, 5:40am

I have the same doubt, right now it works when I set
“response_modalities”: [“TEXT”]
or
“response_modalities”: [“AUDIO”]

But it doesn’t work for

“response_modalities”: [“AUDIO”,“TEXT”]

Is there any possible way to get both AUDIO and TEXT responses back from the model?

KStarobinets · January 9, 2025, 1:18pm

I have the same question.

This is what I get from API:

interface GenerativeContentBlob {
mimeType: string;
data: string;
}

So it seems API does not return transcript, only audio.

I think of sending audio to some audio-to-text API to get transcript but I don’t like this idea. OpenAI’s API just returns transcript together with audio.

Have you found a more elegant solution?

mikecpeck · February 3, 2025, 8:38pm

We are looking for the same and haven’t found anything yet on returning both Audio and Text in the response. Thinking of doing 2 requests, but don’t like that for our use case (Conversational Surveys).

chirag1 · April 19, 2025, 8:38pm

Hello facing same issue did you get any fix or any method from which can we achieve audio and text simuntaneously ?? Please help

margielamob · April 25, 2025, 2:25am

yes but it doesn’t generate both audio and text output simultaneously. It deals only with text, which is then also converted into an audio output using a text-to-speech system. I believe you should have the response_modalities as ["TEXT"] and use an open source TTS, there’s a lot of cool ones with better voices imo

margielamob · April 25, 2025, 2:45am

you can use the live audio transcription feature. API Gemini Live | Generative AI on Vertex AI | Google Cloud

Shubham_Sahu · May 6, 2025, 2:18pm

yes, while setting up the session , you can pass outputAudioTranscription , and it will give the transcription for its audio response.
like this-
{
“setup”: {
“model”: “models/gemini-2.0-flash-live-001”,
“generationConfig”: {
“responseModalities”: [“AUDIO”],
},
“systemInstruction”: {
“parts”: [
{“text”: greetingPrompt}
]
},
“realtimeInputConfig”: {
“automaticActivityDetection”: {“disabled”: true}
},
“outputAudioTranscription”: {}
}
}

and while receiving, you can find it in servercontent object -
{
“serverContent”:{
“outputTranscription”:{ “text” : “your transcript”}
}
}

Dorofino · June 12, 2025, 2:53am

this is not working, I dont see outputTranscription under serverContent

Harsh69 · July 22, 2025, 12:26pm

Are you calling a tool ? if yes then currently it doesnt provide output_transcripts for the tool response , if your are calling it normally then you can find it under server_content.output_transcription.text

Topic		Replies	Views
How to get text output from gemini-2.5-flash-preview-native-audio-dialog Gemini API showcase	3	221	July 10, 2025
outputAudioTranscription NOT WORKING WHEN [Modality.AUDIO] Gemini API api , models , gemini-flash	2	101	June 19, 2025
Received 1007 invalid payload using Gemini Live API Gemini API api , text	6	533	June 19, 2025
Realtime Transcription in Multimodal Live API Gemini API ai-studio , fastapi	3	374	May 6, 2025
How does one get access to the API for TTS features of Gemini-2.0? Google AI Studio feature_request	8	1397	December 21, 2024

Will it be possible to receive text and audio data in the multimodal API?

Related topics