How to add a real-time AI avatar (with lip-sync) to a google AI Studio streaming chatbot?

Hi everyone,

I’ve built a real-time voice chatbot project, and it’s working well technically. I’m now trying to add a “face” to my assistant—a 2D or 3D character that speaks the responses with accurate, real-time lip-syncing.

I’m running into a wall, as all the AI avatar services I’ve found (like Google Vids, HeyGen, etc.) are not real-time. They are asynchronous generators that take a full script and render a video file, which doesn’t work for a live, streaming conversation.

How can I achieve real-time avatar lip-sync with my current Google Cloud stack?

My Current Tech Stack

  • Frontend: index.html with chat.js.

  • Audio Capture (User): Web Audio API (AudioContext, MediaStream) capturing 16kHz PCM audio.

  • Backend: A Node.js (ws-proxy-simple.js) server acting as a WebSocket proxy.

  • AI Service: I’m streaming audio directly to the Gemini 2.5 Flash Native Audio API (the BidiGenerateContent endpoint) via a WebSocket.

  • Audio Playback (AI): The proxy streams back the AI’s audio, which I play in the browser using the Web Audio API (AudioContext.decodeAudioData).

The entire loop is real-time and streaming, so the avatar solution must also be real-time.

The Core Problem

How do I get the “lip shape” (viseme) data to animate my avatar? As I see it, I have two options, but I’m not sure how to proceed with either.

Option 1: Does Google provide viseme data? Does the Gemini streaming API or the Vertex AI Text-to-Speech API have an option to send back a JSON stream of viseme (lip-sync) data alongside the audio stream? I’ve seen that some other cloud providers offer this, but I can’t find it in the Google documentation. This would be the ideal solution.

Option 2: The “DIY” real-time audio analysis? If Google doesn’t provide viseme data, I assume I must analyze the AI’s audio stream in real-time in the browser. My plan would be:

  1. Get a 3D Model: Use a service like Ready Player Me to get a free avatar with facial “blend shapes.”

  2. Render in Browser: Use Three.js to load and display this model in my index.html.

  3. Analyze Audio: Take the AI’s audio stream from my AudioContext and connect it to an AnalyserNode.

  4. Map Audio to Lip-Sync: Use a library (like the open-source Wawa-Lipsync?) to map the frequency data from the AnalyserNode to the 3D model’s blend shapes.

Has anyone successfully done this? It seems complex, and I’m worried about the performance and accuracy.

My Questions

  1. What is the “Google” way to do this? Is there an official Google Cloud service or partnered SDK (like Soul Machines) that is built for this exact real-time use case?

  2. If I have to build it myself (Option 2), am I on the right track? Are there better libraries or tutorials for connecting a real-time audio stream to a Three.js avatar’s face?

  3. What other Google Cloud features could I use? Since I’m already on the platform, what other “wow” features could I add to my chatbot? I’ve seen Imagen (image generation) and Veo (video generation), but are there other cool APIs that work well with a chatbot?

I’m open to any suggestions, from official paid SDKs to open-source libraries.

Thanks in advance!

Hi @Mehedi_Hasan_Shihab
I will try to address your questions concerning the Gemini API and Gemini models.
Regarding Option 1, the Gemini Multimodal Live API currently streams PCM audio and text but does not provide viseme or blendshape metadata for animation.

Regarding Option 2 (Three.js, Ready Player Me, or external lip-sync libraries), we don’t have much info regarding that.

You can also implement Tool Use (Function Calling) within the Live API to enable your model to take actions based on the conversation.
If you have any questions related to Gemini models please let us know .
Thanks