How to add a real-time AI avatar (with lip-sync) to a google AI Studio streaming chatbot?

Hi everyone,

I’ve built a real-time voice chatbot project, and it’s working well technically. I’m now trying to add a “face” to my assistant—a 2D or 3D character that speaks the responses with accurate, real-time lip-syncing.

I’m running into a wall, as all the AI avatar services I’ve found (like Google Vids, HeyGen, etc.) are not real-time. They are asynchronous generators that take a full script and render a video file, which doesn’t work for a live, streaming conversation.

How can I achieve real-time avatar lip-sync with my current Google Cloud stack?

My Current Tech Stack

  • Frontend: index.html with chat.js.

  • Audio Capture (User): Web Audio API (AudioContext, MediaStream) capturing 16kHz PCM audio.

  • Backend: A Node.js (ws-proxy-simple.js) server acting as a WebSocket proxy.

  • AI Service: I’m streaming audio directly to the Gemini 2.5 Flash Native Audio API (the BidiGenerateContent endpoint) via a WebSocket.

  • Audio Playback (AI): The proxy streams back the AI’s audio, which I play in the browser using the Web Audio API (AudioContext.decodeAudioData).

The entire loop is real-time and streaming, so the avatar solution must also be real-time.

The Core Problem

How do I get the “lip shape” (viseme) data to animate my avatar? As I see it, I have two options, but I’m not sure how to proceed with either.

Option 1: Does Google provide viseme data? Does the Gemini streaming API or the Vertex AI Text-to-Speech API have an option to send back a JSON stream of viseme (lip-sync) data alongside the audio stream? I’ve seen that some other cloud providers offer this, but I can’t find it in the Google documentation. This would be the ideal solution.

Option 2: The “DIY” real-time audio analysis? If Google doesn’t provide viseme data, I assume I must analyze the AI’s audio stream in real-time in the browser. My plan would be:

  1. Get a 3D Model: Use a service like Ready Player Me to get a free avatar with facial “blend shapes.”

  2. Render in Browser: Use Three.js to load and display this model in my index.html.

  3. Analyze Audio: Take the AI’s audio stream from my AudioContext and connect it to an AnalyserNode.

  4. Map Audio to Lip-Sync: Use a library (like the open-source Wawa-Lipsync?) to map the frequency data from the AnalyserNode to the 3D model’s blend shapes.

Has anyone successfully done this? It seems complex, and I’m worried about the performance and accuracy.

My Questions

  1. What is the “Google” way to do this? Is there an official Google Cloud service or partnered SDK (like Soul Machines) that is built for this exact real-time use case?

  2. If I have to build it myself (Option 2), am I on the right track? Are there better libraries or tutorials for connecting a real-time audio stream to a Three.js avatar’s face?

  3. What other Google Cloud features could I use? Since I’m already on the platform, what other “wow” features could I add to my chatbot? I’ve seen Imagen (image generation) and Veo (video generation), but are there other cool APIs that work well with a chatbot?

I’m open to any suggestions, from official paid SDKs to open-source libraries.

Thanks in advance!

1 Like

Hi @Mehedi_Hasan_Shihab
I will try to address your questions concerning the Gemini API and Gemini models.
Regarding Option 1, the Gemini Multimodal Live API currently streams PCM audio and text but does not provide viseme or blendshape metadata for animation.

Regarding Option 2 (Three.js, Ready Player Me, or external lip-sync libraries), we don’t have much info regarding that.

You can also implement Tool Use (Function Calling) within the Live API to enable your model to take actions based on the conversation.
If you have any questions related to Gemini models please let us know .
Thanks

Greetings, I apologize for any inconvenience, but I was hoping you might lend your expertise in constructing something akin to my username, which involves animating a cutout from a 2D image, essentially bringing a person to life through real-time animation. Furthermore, I am curious as to whether your proposed solution entails the utilization of a video iframe.

Hi @Pannaga_J,
Thank you for the clarification regarding the current limitations of the Gemini Multimodal Live API.

I understand that at the moment the Live API streams PCM audio and text only, and does not expose viseme or blendshape metadata for real-time avatar animation.

I wanted to ask if there are any alternative or upcoming Google-supported technologies that could help address this use case, for example:

  1. Any roadmap plans to expose viseme / phoneme / blendshape metadata in future Gemini Live or native audio models?

  2. Availability of other Google APIs or services (current or experimental) that can assist with real-time lip-sync or facial animation when paired with Gemini audio output.

  3. Recommended best-practice architecture for achieving real-time avatars using Gemini (e.g., integrating Gemini audio with external speech-to-viseme or animation pipelines).

  4. Any internal examples, demos, or reference implementations that showcase real-time conversational avatars using Google AI tooling.

Even high-level guidance on supported or recommended integrations would be very helpful, as this is a common requirement for real-time AI avatar experiences.

Thanks again for your response and support.
Looking forward to your insights.

Best regards,
Mehedi Hasan Shihab

Hi @Nouman_Javaid1,
Thank you for reaching out and for explaining your use case so clearly.

This is an interesting area, and we’d like to review it in more detail. We’ll discuss your query internally with our team and evaluate the feasibility and possible approaches. Once we’ve aligned internally, we’ll get back to you with our decision and any relevant updates.

Thanks for your patience, and we appreciate your interest.

Best regards,
Mehedi Hasan Shihab