Different Speech to Text models offered by Google

I’m a bit puzzled by the different Speech to Text models offered by Google. Can you help me understand the different offers?

There is the old one on https://cloud.google.com/speech-to-text
This is not an LLM but a purpose-built “old” type architecture, right?

Then from the ElevenLabs announcement I see that one of the best, state-of-the-art Speech to Text models is Gemini 2.0 Flash?

But how is it doing Speech to Text? You send a prompt “please transcribe the following audio”, or similar?

Also, when I click on the green: “Try Gemini 2.0 Flash, our newest model with low latency and enhanced performance” link at the top of cloud.google.com/speech-to-text, it brings me to this page: console.cloud.google.com/vertex-ai/studio/freeform where if I click Speech / Speech-to-Text

I end up on a page with:

“About this model “Chirp 2” is a multilingual automatic speech recognition (ASR) model developed by Google that transcribes speech (Speech-to-Text).”

I mean, I’m following Gemini progress quite closely, and still, the only way to figure out that Gemini Flash 2.0 is one of the best / most affordable Speech to Text models is from a competitor’s announcement?

Please help me understand these models.

Hi @hyperknot, Welcome to the forum!!!

Let me give a try.

The earlier models like “Chirp 2” were very specific to speech, they can convert speech to text. So, if you want some response from LLM using speech, first you need to hit speech-to-text model then send the text to some text processing LLM model to get the response.

Now, Gemini has multimodal capabilities.

In Gemini-1.5 models, you can insert multimodal input (like text, image, audio, video) and get the response as text only.

Gemini-2.0 series are even more powerful, you can insert multimodal input (like text, image, audio, video) and get multimodal response as well. But, speech and image output is even in private/public preview, so text as output only.

Coming to clicking on green “Try Gemini 2.0 Flash, our newest model with low latency and enhanced performance” : It is taking you to Vertex AI Studio where you can try new Gemini 2.0 Flash model. So, either you can record your voice or upload any audio file, add it to prompt and just write “Transcribe” it will convert voice/audio to text as it is.

Similarly, there is Google AI Studio where you can try latest Gemini/Gemma models.

You are correct : write prompt “please transcribe the following audio” or just “Transcribe” that will convert speech to text.

I hope i clarified your doubt to some extent.

Thanks

1 Like