how can i transcribe text to speech and vice versa and speech to speech using gemini please
Hey there and welcome to the community!
So, even though these are multi-modal models, for now it can only generate text. So:
is not possible at the moment.
Currently, Gemini cannot perform transcription or speech-to-speech tasks. It is an LLM designed for generating text. However, the 1.5 preview via Vertex AI can take audio input for transcription, but not for Text-to-Speech (TTS) tasks. For TTS and transcription needs, you can explore various Speech-to-Text (ASR) and Text-to-Speech (TTS) services, such as:
Google Cloud Speech-to-Text
For Speech-to-Text (ASR): Google Cloud Speech-to-Text
OpenAI
For Speech-to-Text (ASR): OpenAI Whisper
Google Cloud Text-to-Speech
For Text-to-Speech (TTS): Google Cloud Text-to-Speech
Elevenslab
For Text-to-Speech (TTS): Elevenslab
Welcome to the Google AI dev forum @Ursulla_Zeking
Hereās docs on how you can prompt with media file including audio.
Additionally thereās also this audio quickstart notebook from the gemini cookbook.
AFAIK the capability is only for text output. Gemini atm doesnāt offer TTS.
Gemini is an text generative AI model, it cannot convert text to speech but you can try Gemini-1.5 to convert speech to text. You can create an program which records your voice and pass to Gemini-1.5 as audio file and in prompt you can ask to transcribe.
Even for the STT capability as claimed by Gemini 1.5 pro it fails for audios which is of 25-30minutes. Any idea about it?
IMO the Speech prompting which can be used as STT isnāt meant for transcription, instead itās meant to directly prompt the model with voice, instead of having to first transcribe and then making a chat completion API call to the model.
If the goal is simply to transcribe, Iād recommend using Google STT instead.
But i could see this from Google Docs. This when we access gemini preview version through vertex AI
Interesting. can you describe how the transcriptions are failing?
Iāve been trying to transcribe an audio file of 25 minutes that is not in English. Here are the methods I tried:
-
I tried using Google AI Studio, which worked well and gave me a structured output.
-
I attempted the same task through the generative AIās Python SDK (without exceeding the token limit) and received a 504 error stating ādeadline exceeded.ā I tried this with an API key associated with a billing project enabled and also the opposite way.
-
I tried building it through Vertex AI, but the generation seems stuck. I attempted both streaming and non-streaming responses, but neither worked.
Please note, all these methods work for shorter audios.
Interestingly, the 25-minute non english audio was working until last Friday through Vertex AI. Any idea on this weird behavior?
What I would do: split the audio in two sub-15 min files, transcribe each half separately, and rejoin the text files. That will get you past the processing time deadline that the 504 represents.
Thanks for sharing this info @soumya_sebastian
504 indicates gateway timeout.
Iād recommend testing with higher timeouts using request options:
e.g.
response = model.generate_content(request,
request_options={"timeout": 600})
"
At the command line, only need to run once to install the package via pip:
$ pip install google-generativeai
āā"
import os
import os
from dotenv import load_dotenv
import google.generativeai as genai
load_dotenv()
genai.configure(api_key=api_key)
Set up the model
generation_config = {
ātemperatureā: 1,
ātop_pā: 0.95,
ātop_kā: 64,
āmax_output_tokensā: 8192,
āresponse_mime_typeā: āapplication/jsonā,
}
safety_settings = [
{
ācategoryā: āHARM_CATEGORY_HARASSMENTā,
āthresholdā: āBLOCK_NONEā,
},
{
ācategoryā: āHARM_CATEGORY_HATE_SPEECHā,
āthresholdā: āBLOCK_NONEā,
},
{
ācategoryā: āHARM_CATEGORY_SEXUALLY_EXPLICITā,
āthresholdā: āBLOCK_NONEā,
},
{
ācategoryā: āHARM_CATEGORY_DANGEROUS_CONTENTā,
āthresholdā: āBLOCK_NONEā,
},
]
file_path = ā../audio_ogg/157726.oggā
display_name = āSample audioā
file_response = genai.upload_file(path=file_path, mime_type=āaudio/oggā)
print(āfile_responseā, file_response)
print(f"Uploaded file {file_response.display_name} as: {file_response.uri}")
prompt = āTranscribe the audio in english.ā
Verify the file is uploaded to the API
get_file = genai.get_file(name=file_response.name)
print(f"Retrieved file {get_file.display_name} as: {get_file.uri}")
model = genai.GenerativeModel(model_name=āmodels/gemini-1.5-pro-latestā, generation_config=generation_config, safety_settings=safety_settings)
response = model.generate_content([prompt, file_response], request_options={ātimeoutā: 18000})
print(response)
print(response.text)
So this is the sample code i am using. Tried with higher timeouts
I am still getting an error like this ,
raise ValueError(
ValueError: The response.text quick accessor only works when the response contains a valid Part, but none was returned. Check the candidate.safety_ratings to see if the response was blocked.
Would you check?
PS: The audio doesnāt contain any abusive, hatred, or dangerous content. Also removed the safety settings and tried but still facing the same error.
What are the results from the print(response) line?
As the error you showed suggests, no text is returned. You need to take a look at response.candidate[0].finish_reason and response.prompt_feedback to see what the possible cause is.
What language is the audio in? Can you share the audio? Iād like to see if I can reproduce this at my end.
