Cost estimation for audio input and text output

Hi everyone! I’m looking to compare the cost of running an audio summarization application via feeding the audio into Gemini versus a cascaded approach of transcription then LLM summarization.

The input audio’s length can vary from 30 minutes to 50 minutes. The output text is approximately 230 words, which is a summary of the input audio. Currently, I am feeding the audio to a speech-to-text model, then feeding the transcript to an LLM to generate a summary. If I were to use Gemini in production for this usecase, how much would it cost per use?

Thanks!

Hey! The simple way to test this is to put the audio into AI Studio and see how many tokens it parses into. You can then use the pricing here: Gemini API Pricing  |  Google for Developers to calculate the cost.

Thanks Logan! I did some further calculations, and I found that per minute, it would cost ~1920 tokens, and the average output for my use case would be ~250 tokens.

Based on the pricing page, the cost for input is $0.35 / 1 million tokens (for prompts up to 128K tokens), and 128K tokens is approximately 67 minutes, which is more than what I need. And for the output, I would be under the $1.05 / 1 million tokens (for prompts up to 128K tokens) umbrella.

The question that I have now is: say that my input would be a total of 60k tokens, and my output is 250 tokens, would I be charged $0.35 for the input and $1.05 for the output, or would the cost go down proportional to my token use?

The cost is proportional to the rate for the total amount of tokens during the billing period. It is expressed as cost / million tokens to make it easier to read.