Are audio output tokens equal to text output tokens?

William_Kelley · May 22, 2025, 12:36am

I am playing around with the Gemini Flash/Pro TTS models, but I don’t understand the pricing.

Pricing is based on 1 million tokens as usual, but are those standard text-based tokens or are there special audio/multi-media tokens? I want to compare them to ElevenLabs and Hume, both of which charge by character.

Some basic analysis and math with Gemini says that if audio token output token calculation is the same as text tokens, then Gemini TTS is about two orders of magnitude cheaper than ElevenLabs or Hume. That can’t be right. Something is missing or Gemini and I did not calculate correctly.

Richard_Davey · May 22, 2025, 12:56am

There are no special ‘audio tokens’ - they’re all just tokens. The Gemini docs state:

“Gemini represents each second of audio as 32 tokens; for example, one minute of audio is represented as 1,920 tokens.”

If you want to know the exact amount you can use the File API to upload an audio clip and call ‘count tokens’ on it.

Topic		Replies	Views
Gemini 2.5 Flash Preview TTS Gemini API gemini-flash , billing	1	196	July 21, 2025
Gemini 2.5 TTS workflow questions Gemini API audio , gemini-flash	1	175	June 6, 2025
Could someone help me understand gemini live pricing? Gemini API api , models , billing	1	349	June 23, 2025
Gemini 2.0 Flash Audio Input Pricing Gemini API gemini-flash	1	245	June 17, 2025
Pricing and usages for S2S (speech to speech) models Gemini API gemini , audio	5	128	November 28, 2025

Are audio output tokens equal to text output tokens?

Related topics