Transcribe text to text and vice versa, speech to speech and image to text in a flutter app using gemini

how can i transcribe text to speech and vice versa and speech to speech using gemini please

1 Like

Hey there and welcome to the community!

So, even though these are multi-modal models, for now it can only generate text. So:

is not possible at the moment.

Currently, Gemini cannot perform transcription or speech-to-speech tasks. It is an LLM designed for generating text. However, the 1.5 preview via Vertex AI can take audio input for transcription, but not for Text-to-Speech (TTS) tasks. For TTS and transcription needs, you can explore various Speech-to-Text (ASR) and Text-to-Speech (TTS) services, such as:


Google Cloud Speech-to-Text
For Speech-to-Text (ASR): Google Cloud Speech-to-Text

OpenAI
For Speech-to-Text (ASR): OpenAI Whisper

Google Cloud Text-to-Speech
For Text-to-Speech (TTS): Google Cloud Text-to-Speech

Elevenslab
For Text-to-Speech (TTS): Elevenslab


image

2 Likes

Welcome to the Google AI dev forum @Ursulla_Zeking

Here’s docs on how you can prompt with media file including audio.

Additionally there’s also this audio quickstart notebook from the gemini cookbook.

AFAIK the capability is only for text output. Gemini atm doesn’t offer TTS.

1 Like

Gemini is an text generative AI model, it cannot convert text to speech but you can try Gemini-1.5 to convert speech to text. You can create an program which records your voice and pass to Gemini-1.5 as audio file and in prompt you can ask to transcribe.

Even for the STT capability as claimed by Gemini 1.5 pro it fails for audios which is of 25-30minutes. Any idea about it?

Hi @soumya_sebastian

IMO the Speech prompting which can be used as STT isn’t meant for transcription, instead it’s meant to directly prompt the model with voice, instead of having to first transcribe and then making a chat completion API call to the model.

If the goal is simply to transcribe, I’d recommend using Google STT instead.

1 Like

But i could see this from Google Docs. This when we access gemini preview version through vertex AI

2 Likes

Interesting. can you describe how the transcriptions are failing?

@sps

I’ve been trying to transcribe an audio file of 25 minutes that is not in English. Here are the methods I tried:

  1. I tried using Google AI Studio, which worked well and gave me a structured output.

  2. I attempted the same task through the generative AI’s Python SDK (without exceeding the token limit) and received a 504 error stating “deadline exceeded.” I tried this with an API key associated with a billing project enabled and also the opposite way.

  3. I tried building it through Vertex AI, but the generation seems stuck. I attempted both streaming and non-streaming responses, but neither worked.

Please note, all these methods work for shorter audios.

Interestingly, the 25-minute non english audio was working until last Friday through Vertex AI. Any idea on this weird behavior?

What I would do: split the audio in two sub-15 min files, transcribe each half separately, and rejoin the text files. That will get you past the processing time deadline that the 504 represents.

Thanks for sharing this info @soumya_sebastian

504 indicates gateway timeout.

I’d recommend testing with higher timeouts using request options:
e.g.

response = model.generate_content(request,
                                  request_options={"timeout": 600})

@sps

"
At the command line, only need to run once to install the package via pip:

$ pip install google-generativeai
“”"
import os
import os
from dotenv import load_dotenv
import google.generativeai as genai
load_dotenv()

genai.configure(api_key=api_key)

Set up the model

generation_config = {
“temperature”: 1,
“top_p”: 0.95,
“top_k”: 64,
“max_output_tokens”: 8192,
“response_mime_type”: “application/json”,
}

safety_settings = [
{
“category”: “HARM_CATEGORY_HARASSMENT”,
“threshold”: “BLOCK_NONE”,
},
{
“category”: “HARM_CATEGORY_HATE_SPEECH”,
“threshold”: “BLOCK_NONE”,
},
{
“category”: “HARM_CATEGORY_SEXUALLY_EXPLICIT”,
“threshold”: “BLOCK_NONE”,
},
{
“category”: “HARM_CATEGORY_DANGEROUS_CONTENT”,
“threshold”: “BLOCK_NONE”,
},
]

file_path = “…/audio_ogg/157726.ogg”
display_name = “Sample audio”
file_response = genai.upload_file(path=file_path, mime_type=“audio/ogg”)
print(“file_response”, file_response)
print(f"Uploaded file {file_response.display_name} as: {file_response.uri}")
prompt = “Transcribe the audio in english.”

Verify the file is uploaded to the API

get_file = genai.get_file(name=file_response.name)
print(f"Retrieved file {get_file.display_name} as: {get_file.uri}")
model = genai.GenerativeModel(model_name=“models/gemini-1.5-pro-latest”, generation_config=generation_config, safety_settings=safety_settings)
response = model.generate_content([prompt, file_response], request_options={“timeout”: 18000})
print(response)
print(response.text)

So this is the sample code i am using. Tried with higher timeouts

I am still getting an error like this ,

raise ValueError(

ValueError: The response.text quick accessor only works when the response contains a valid Part, but none was returned. Check the candidate.safety_ratings to see if the response was blocked.

Would you check?

PS: The audio doesn’t contain any abusive, hatred, or dangerous content. Also removed the safety settings and tried but still facing the same error.

What are the results from the print(response) line?

As the error you showed suggests, no text is returned. You need to take a look at response.candidate[0].finish_reason and response.prompt_feedback to see what the possible cause is.

What language is the audio in? Can you share the audio? I’d like to see if I can reproduce this at my end.