Need for Modality Recomposition: Access to TTS and STT API required

Hi. I’ve noticed that properly setting the personality of my test chat bot only works when I set the modality to “TEXT”, like this. This however makes it impossible for me to obtain the Aoede voice output, because you don’t expose the API for generating the voice separately.

{
      model: "models/gemini-2.0-flash-exp",
      generationConfig: {
        responseModalities: "text",
        speechConfig: {
          voiceConfig: { prebuiltVoiceConfig: { voiceName: "Aoede" } },
        },
      },
      systemInstruction: {
        parts: [
          {
            text: 'You are Google-chan, the agentic chatbot interface for the Google search engine. Your personality is the one of a young witty woman, with a sense of humor.',
          },
        ],
      },
      tools: [
        { googleSearch: {} },
        { functionDeclarations: [
            //system_shell_declaration
          ]
        }
      ],
}

In addition, my goal is to have an interactive avatar with lipsync, which can also access a server with a shell for which I provide the access credentials, as well as do talk first, but as long as I can’t get the audio chunks plus some form of information about the phonemes or simply the spoken text, I can’t obtain real time visemes for moving the lips while playing the audio chunks.

The idea at the end is, to have an avatar like this pop up in the future, which based on the tracked search queries, remembers when my list of suggested links is related to a prior conversation and reminds me of the context of that conversation. In addition, it could make smart home interfaces more engaging.

Right now I can’t make Gemini 2.0 work with an avatar, because the init prompt doesn’t stick when using the “AUDIO” modality, audio chunks to not provide a string specifying the utterance so that I can convert it into a Ready Player Me viseme animation, and when using the “TEXT” modality, I can’t easily convert it back into a realistic voice anymore.

Please make the API tell me what utterances the sounds in the audio chunks represent. Pretty please?

EDIT: One idea was to synthesize the init prompt as audio, maybe then the audio modality would properly pick up on it. But that shouldn’t happen… unless you really have TTS incorporated into the model itself and it spits out WAV file embeddings by itself… baking Suno into Gemini, yey, what could possibly go wrong… Then again, I noticed that when I test with https://aistudio.google.com/live it provides me the text alongside the audio output, so I guess it’s more an issue that the API doesn’t slap the substring on which the audio chunk is based on inside the “text=” field… this would be such an easy fix. The value is already there, just copy it over…