The new Gemini Multimodal Live API is great for voice-to-voice conversational AI and has video input, too. It’s really cool.
Google’s docs are here:
There are also Open Source client SDKs for the Web, React, Android, iOS, and C++ that are part of the Pipecat ecosystem. These SDKs have device management, echo cancellation, and noise reduction built in, plus lots of other features including hooks for function calling and tool use. They support both WebSocket and WebRTC network transport.
Here’s a getting started guide for using WebRTC and these clients with Gemini 2.0: https://docs.pipecat.ai/guides/features/gemini-multimodal-live
And here’s a full-featured starter kit — a chat application with:
- a voice-to-voice WebSocket mode,
- an HTTP mode for text and image input, and
- a WebRTC mode with text, voice, camera video and screenshare video