I’ve been trying out NotebookLM’s audio overview feature for a while and listening to actual podcasts/interviews of Steve Johnston (VP) or Raiza Martin (Senior PM) of NotebookLM at Google. They keep saying that they are using Gemini for AI, but I have never heard about the AI model used to create the audio voices. Does anyone have any idea?
I can’t seem to figure out how the audio feels like a conversation. Sure Gemini can be used to create the script but the fact that the conversation feels so genuine or how one person kind of talks immediately after the other. It doesn’t feel like it was just 2 different voices using TTS APIs because it feels quite natural.
I’ve tried various AI audio models such as OpenAI, AWS Polly, and Google TTS but nothing can replicate this conversation-like behavior. Is it publicly available? Is it an open source project?
Gemini APIs don’t have audio outputs so what is it? Very curious!
I listened to the deep dive podcast where the two hosts are the two NotebookLM voices. The way they talk, how they intonate, emphasize is so natural, that until I’ve heard NotebookLM I didn’t realize it was AI. The way they go back and forth, question, etc. It feels a level above the TTS I use for my multi modal, voice enabled agent submission. The responses of my agents are fine, but not as vibrant and living, don’t feel as engaging. I wouldn’t say robotic, because it’s quite good, but compared to NotebookLM it’s as robotic as you’d compare mine to a decade+ earlier TTSs.
Again, the magic of NotebookLM isn’t that it doesn’t just sound like really high-quality TTS like this Journey model. I feel like there is some sort of additional AI model that combines 2 high-quality voices in a way that feels very natural and conversation-like. It’s hard to explain unless you’ve tried it yourself: https://notebooklm.google.com/
They already have millions of users for a reason… and no it’s not because Google is marketing it at all.