How to add a real-time AI avatar (with lip-sync) to a google AI Studio streaming chatbot?

Mehedi_Hasan_Shihab · November 2, 2025, 11:24am

Hi everyone,

I’ve built a real-time voice chatbot project, and it’s working well technically. I’m now trying to add a “face” to my assistant—a 2D or 3D character that speaks the responses with accurate, real-time lip-syncing.

I’m running into a wall, as all the AI avatar services I’ve found (like Google Vids, HeyGen, etc.) are not real-time. They are asynchronous generators that take a full script and render a video file, which doesn’t work for a live, streaming conversation.

How can I achieve real-time avatar lip-sync with my current Google Cloud stack?

My Current Tech Stack

Frontend: index.html with chat.js.
Audio Capture (User): Web Audio API (AudioContext, MediaStream) capturing 16kHz PCM audio.
Backend: A Node.js (ws-proxy-simple.js) server acting as a WebSocket proxy.
AI Service: I’m streaming audio directly to the Gemini 2.5 Flash Native Audio API (the BidiGenerateContent endpoint) via a WebSocket.
Audio Playback (AI): The proxy streams back the AI’s audio, which I play in the browser using the Web Audio API (AudioContext.decodeAudioData).

The entire loop is real-time and streaming, so the avatar solution must also be real-time.

The Core Problem

How do I get the “lip shape” (viseme) data to animate my avatar? As I see it, I have two options, but I’m not sure how to proceed with either.

Option 1: Does Google provide viseme data? Does the Gemini streaming API or the Vertex AI Text-to-Speech API have an option to send back a JSON stream of viseme (lip-sync) data alongside the audio stream? I’ve seen that some other cloud providers offer this, but I can’t find it in the Google documentation. This would be the ideal solution.

Option 2: The “DIY” real-time audio analysis? If Google doesn’t provide viseme data, I assume I must analyze the AI’s audio stream in real-time in the browser. My plan would be:

Get a 3D Model: Use a service like Ready Player Me to get a free avatar with facial “blend shapes.”
Render in Browser: Use Three.js to load and display this model in my index.html.
Analyze Audio: Take the AI’s audio stream from my AudioContext and connect it to an AnalyserNode.
Map Audio to Lip-Sync: Use a library (like the open-source Wawa-Lipsync?) to map the frequency data from the AnalyserNode to the 3D model’s blend shapes.

Has anyone successfully done this? It seems complex, and I’m worried about the performance and accuracy.

My Questions

What is the “Google” way to do this? Is there an official Google Cloud service or partnered SDK (like Soul Machines) that is built for this exact real-time use case?
If I have to build it myself (Option 2), am I on the right track? Are there better libraries or tutorials for connecting a real-time audio stream to a Three.js avatar’s face?
What other Google Cloud features could I use? Since I’m already on the platform, what other “wow” features could I add to my chatbot? I’ve seen Imagen (image generation) and Veo (video generation), but are there other cool APIs that work well with a chatbot?

I’m open to any suggestions, from official paid SDKs to open-source libraries.

Thanks in advance!

Pannaga_J · November 27, 2025, 12:34pm

Hi @Mehedi_Hasan_Shihab
I will try to address your questions concerning the Gemini API and Gemini models.
Regarding Option 1, the Gemini Multimodal Live API currently streams PCM audio and text but does not provide viseme or blendshape metadata for animation.

Regarding Option 2 (Three.js, Ready Player Me, or external lip-sync libraries), we don’t have much info regarding that.

You can also implement Tool Use (Function Calling) within the Live API to enable your model to take actions based on the conversation.
If you have any questions related to Gemini models please let us know .
Thanks

Topic		Replies	Views
Need for Modality Recomposition: Access to TTS and STT API required Gemini API api , text-vectorization	1	208	August 7, 2025
How do I build a custom voice recognition model for multiple people? TF.js tfjs , datasets , help_request	24	7204	September 18, 2021
Using the narrators voice General Discussion help_request	4	488	July 12, 2021
Regarding Google Project ready Voice module Gemini API gemini-15 , ai-studio , api , vertexai , gemini	2	54	November 27, 2025
Gemini flash Live API docs chaos sorted out: Documentation api , models	6	806	May 2, 2025

How to add a real-time AI avatar (with lip-sync) to a google AI Studio streaming chatbot?

My Current Tech Stack

The Core Problem

My Questions

Related topics