Different Speech to Text models offered by Google

hyperknot · February 28, 2025, 2:48pm

I’m a bit puzzled by the different Speech to Text models offered by Google. Can you help me understand the different offers?

There is the old one on https://cloud.google.com/speech-to-text
This is not an LLM but a purpose-built “old” type architecture, right?

Then from the ElevenLabs announcement I see that one of the best, state-of-the-art Speech to Text models is Gemini 2.0 Flash?

But how is it doing Speech to Text? You send a prompt “please transcribe the following audio”, or similar?

Also, when I click on the green: “Try Gemini 2.0 Flash, our newest model with low latency and enhanced performance” link at the top of cloud.google.com/speech-to-text, it brings me to this page: console.cloud.google.com/vertex-ai/studio/freeform where if I click Speech / Speech-to-Text

I end up on a page with:

“About this model “Chirp 2” is a multilingual automatic speech recognition (ASR) model developed by Google that transcribes speech (Speech-to-Text).”

I mean, I’m following Gemini progress quite closely, and still, the only way to figure out that Gemini Flash 2.0 is one of the best / most affordable Speech to Text models is from a competitor’s announcement?

Please help me understand these models.

Govind_Keshari · March 4, 2025, 9:36am

Hi @hyperknot, Welcome to the forum!!!

Let me give a try.

The earlier models like “Chirp 2” were very specific to speech, they can convert speech to text. So, if you want some response from LLM using speech, first you need to hit speech-to-text model then send the text to some text processing LLM model to get the response.

Now, Gemini has multimodal capabilities.

In Gemini-1.5 models, you can insert multimodal input (like text, image, audio, video) and get the response as text only.

Gemini-2.0 series are even more powerful, you can insert multimodal input (like text, image, audio, video) and get multimodal response as well. But, speech and image output is even in private/public preview, so text as output only.

Coming to clicking on green “Try Gemini 2.0 Flash, our newest model with low latency and enhanced performance” : It is taking you to Vertex AI Studio where you can try new Gemini 2.0 Flash model. So, either you can record your voice or upload any audio file, add it to prompt and just write “Transcribe” it will convert voice/audio to text as it is.

Similarly, there is Google AI Studio where you can try latest Gemini/Gemma models.

You are correct : write prompt “please transcribe the following audio” or just “Transcribe” that will convert speech to text.

I hope i clarified your doubt to some extent.

Thanks

Topic		Replies	Views
Transcribe text to text and vice versa, speech to speech and image to text in a flutter app using gemini Gemini API	15	703	May 20, 2024
Using Gemini 2.0 As an STT agent Gemini API gemini-20	2	562	June 19, 2025
Real-Time Speech-to-Text Gemini API	1	1354	May 29, 2024
No more "Text only output" in Audio Dialog Google AI Studio audio , gemini-flash	1	75	June 4, 2025
What Audio Model is used for NotebookLM's Audio Overview Feature? Gemini API model , llm	5	2721	October 26, 2024

Different Speech to Text models offered by Google

Related topics