Transcribe text to text and vice versa, speech to speech and image to text in a flutter app using gemini

Ursulla_Zeking · May 6, 2024, 3:32am

how can i transcribe text to speech and vice versa and speech to speech using gemini please

Macha · May 6, 2024, 4:19am

Hey there and welcome to the community!

So, even though these are multi-modal models, for now it can only generate text. So:

is not possible at the moment.

innovatix · May 6, 2024, 9:52am

Currently, Gemini cannot perform transcription or speech-to-speech tasks. It is an LLM designed for generating text. However, the 1.5 preview via Vertex AI can take audio input for transcription, but not for Text-to-Speech (TTS) tasks. For TTS and transcription needs, you can explore various Speech-to-Text (ASR) and Text-to-Speech (TTS) services, such as:

Google Cloud Speech-to-Text
For Speech-to-Text (ASR): Google Cloud Speech-to-Text

OpenAI
For Speech-to-Text (ASR): OpenAI Whisper

Google Cloud Text-to-Speech
For Text-to-Speech (TTS): Google Cloud Text-to-Speech

Elevenslab
For Text-to-Speech (TTS): Elevenslab

sps · May 6, 2024, 11:56am

Welcome to the Google AI dev forum @Ursulla_Zeking

Here’s docs on how you can prompt with media file including audio.

Additionally there’s also this audio quickstart notebook from the gemini cookbook.

AFAIK the capability is only for text output. Gemini atm doesn’t offer TTS.

Suparva · May 6, 2024, 2:34pm

Gemini is an text generative AI model, it cannot convert text to speech but you can try Gemini-1.5 to convert speech to text. You can create an program which records your voice and pass to Gemini-1.5 as audio file and in prompt you can ask to transcribe.

soumya_sebastian · May 13, 2024, 6:11am

Even for the STT capability as claimed by Gemini 1.5 pro it fails for audios which is of 25-30minutes. Any idea about it?

sps · May 13, 2024, 8:58am

Hi @soumya_sebastian

IMO the Speech prompting which can be used as STT isn’t meant for transcription, instead it’s meant to directly prompt the model with voice, instead of having to first transcribe and then making a chat completion API call to the model.

If the goal is simply to transcribe, I’d recommend using Google STT instead.

soumya_sebastian · May 13, 2024, 9:14am

But i could see this from Google Docs. This when we access gemini preview version through vertex AI

sps · May 13, 2024, 9:20am

Interesting. can you describe how the transcriptions are failing?

soumya_sebastian · May 13, 2024, 9:51am

@sps

I’ve been trying to transcribe an audio file of 25 minutes that is not in English. Here are the methods I tried:

I tried using Google AI Studio, which worked well and gave me a structured output.
I attempted the same task through the generative AI’s Python SDK (without exceeding the token limit) and received a 504 error stating “deadline exceeded.” I tried this with an API key associated with a billing project enabled and also the opposite way.
I tried building it through Vertex AI, but the generation seems stuck. I attempted both streaming and non-streaming responses, but neither worked.

Please note, all these methods work for shorter audios.

Interestingly, the 25-minute non english audio was working until last Friday through Vertex AI. Any idea on this weird behavior?

OrangiaNebula · May 13, 2024, 5:46pm

What I would do: split the audio in two sub-15 min files, transcribe each half separately, and rejoin the text files. That will get you past the processing time deadline that the 504 represents.

sps · May 14, 2024, 12:17pm

Thanks for sharing this info @soumya_sebastian

504 indicates gateway timeout.

I’d recommend testing with higher timeouts using request options:
e.g.

response = model.generate_content(request,
                                  request_options={"timeout": 600})

soumya_sebastian · May 20, 2024, 12:37pm

@sps

"
At the command line, only need to run once to install the package via pip:

$ pip install google-generativeai
“”"
import os
import os
from dotenv import load_dotenv
import google.generativeai as genai
load_dotenv()

genai.configure(api_key=api_key)

Set up the model

generation_config = {
“temperature”: 1,
“top_p”: 0.95,
“top_k”: 64,
“max_output_tokens”: 8192,
“response_mime_type”: “application/json”,
}

safety_settings = [
{
“category”: “HARM_CATEGORY_HARASSMENT”,
“threshold”: “BLOCK_NONE”,
},
{
“category”: “HARM_CATEGORY_HATE_SPEECH”,
“threshold”: “BLOCK_NONE”,
},
{
“category”: “HARM_CATEGORY_SEXUALLY_EXPLICIT”,
“threshold”: “BLOCK_NONE”,
},
{
“category”: “HARM_CATEGORY_DANGEROUS_CONTENT”,
“threshold”: “BLOCK_NONE”,
},
]

file_path = “…/audio_ogg/157726.ogg”
display_name = “Sample audio”
file_response = genai.upload_file(path=file_path, mime_type=“audio/ogg”)
print(“file_response”, file_response)
print(f"Uploaded file {file_response.display_name} as: {file_response.uri}")
prompt = “Transcribe the audio in english.”

Verify the file is uploaded to the API

get_file = genai.get_file(name=file_response.name)
print(f"Retrieved file {get_file.display_name} as: {get_file.uri}")
model = genai.GenerativeModel(model_name=“models/gemini-1.5-pro-latest”, generation_config=generation_config, safety_settings=safety_settings)
response = model.generate_content([prompt, file_response], request_options={“timeout”: 18000})
print(response)
print(response.text)

So this is the sample code i am using. Tried with higher timeouts

I am still getting an error like this ,

raise ValueError(

ValueError: The response.text quick accessor only works when the response contains a valid Part, but none was returned. Check the candidate.safety_ratings to see if the response was blocked.

Would you check?

soumya_sebastian · May 20, 2024, 12:41pm

PS: The audio doesn’t contain any abusive, hatred, or dangerous content. Also removed the safety settings and tried but still facing the same error.

afirstenberg · May 20, 2024, 1:16pm

What are the results from the print(response) line?

As the error you showed suggests, no text is returned. You need to take a look at response.candidate[0].finish_reason and response.prompt_feedback to see what the possible cause is.

sps · May 20, 2024, 4:56pm

What language is the audio in? Can you share the audio? I’d like to see if I can reproduce this at my end.

Topic		Replies	Views
Will it be possible to receive text and audio data in the multimodal API? Gemini API models , gemini-api	13	795	July 22, 2025
Gemini 1.5 refuses to process audio files Gemini API gemini-15 , api , web-ml	8	490	September 19, 2024
How to get text output from gemini-2.5-flash-preview-native-audio-dialog Gemini API showcase	3	240	July 10, 2025
Error during translation and response: 'model' Gemini API	1	39	June 23, 2024
Gemini-1.5-flash is no longer processing audio files (500 Exception) - retry does not help Gemini API gemini-15 , bug , models , audio	4	103	April 9, 2025

Transcribe text to text and vice versa, speech to speech and image to text in a flutter app using gemini

Set up the model

Verify the file is uploaded to the API

Related topics