I’m a bit puzzled by the different Speech to Text models offered by Google. Can you help me understand the different offers?
There is the old one on https://cloud.google.com/speech-to-text
This is not an LLM but a purpose-built “old” type architecture, right?
Then from the ElevenLabs announcement I see that one of the best, state-of-the-art Speech to Text models is Gemini 2.0 Flash?
But how is it doing Speech to Text? You send a prompt “please transcribe the following audio”, or similar?
Also, when I click on the green: “Try Gemini 2.0 Flash, our newest model with low latency and enhanced performance” link at the top of cloud.google.com/speech-to-text
, it brings me to this page: console.cloud.google.com/vertex-ai/studio/freeform
where if I click Speech / Speech-to-Text
I end up on a page with:
“About this model “Chirp 2” is a multilingual automatic speech recognition (ASR) model developed by Google that transcribes speech (Speech-to-Text).”
I mean, I’m following Gemini progress quite closely, and still, the only way to figure out that Gemini Flash 2.0 is one of the best / most affordable Speech to Text models is from a competitor’s announcement?
Please help me understand these models.