Hi everyone,
I’ve built a real-time voice chatbot project, and it’s working well technically. I’m now trying to add a “face” to my assistant—a 2D or 3D character that speaks the responses with accurate, real-time lip-syncing.
I’m running into a wall, as all the AI avatar services I’ve found (like Google Vids, HeyGen, etc.) are not real-time. They are asynchronous generators that take a full script and render a video file, which doesn’t work for a live, streaming conversation.
How can I achieve real-time avatar lip-sync with my current Google Cloud stack?
My Current Tech Stack
-
Frontend:
index.htmlwithchat.js. -
Audio Capture (User): Web Audio API (
AudioContext,MediaStream) capturing 16kHz PCM audio. -
Backend: A Node.js (
ws-proxy-simple.js) server acting as a WebSocket proxy. -
AI Service: I’m streaming audio directly to the Gemini 2.5 Flash Native Audio API (the
BidiGenerateContentendpoint) via a WebSocket. -
Audio Playback (AI): The proxy streams back the AI’s audio, which I play in the browser using the Web Audio API (
AudioContext.decodeAudioData).
The entire loop is real-time and streaming, so the avatar solution must also be real-time.
The Core Problem
How do I get the “lip shape” (viseme) data to animate my avatar? As I see it, I have two options, but I’m not sure how to proceed with either.
Option 1: Does Google provide viseme data? Does the Gemini streaming API or the Vertex AI Text-to-Speech API have an option to send back a JSON stream of viseme (lip-sync) data alongside the audio stream? I’ve seen that some other cloud providers offer this, but I can’t find it in the Google documentation. This would be the ideal solution.
Option 2: The “DIY” real-time audio analysis? If Google doesn’t provide viseme data, I assume I must analyze the AI’s audio stream in real-time in the browser. My plan would be:
-
Get a 3D Model: Use a service like Ready Player Me to get a free avatar with facial “blend shapes.”
-
Render in Browser: Use
Three.jsto load and display this model in myindex.html. -
Analyze Audio: Take the AI’s audio stream from my
AudioContextand connect it to anAnalyserNode. -
Map Audio to Lip-Sync: Use a library (like the open-source
Wawa-Lipsync?) to map the frequency data from theAnalyserNodeto the 3D model’s blend shapes.
Has anyone successfully done this? It seems complex, and I’m worried about the performance and accuracy.
My Questions
-
What is the “Google” way to do this? Is there an official Google Cloud service or partnered SDK (like Soul Machines) that is built for this exact real-time use case?
-
If I have to build it myself (Option 2), am I on the right track? Are there better libraries or tutorials for connecting a real-time audio stream to a
Three.jsavatar’s face? -
What other Google Cloud features could I use? Since I’m already on the platform, what other “wow” features could I add to my chatbot? I’ve seen Imagen (image generation) and Veo (video generation), but are there other cool APIs that work well with a chatbot?
I’m open to any suggestions, from official paid SDKs to open-source libraries.
Thanks in advance!