Are audio output tokens equal to text output tokens?

I am playing around with the Gemini Flash/Pro TTS models, but I don’t understand the pricing.

Pricing is based on 1 million tokens as usual, but are those standard text-based tokens or are there special audio/multi-media tokens? I want to compare them to ElevenLabs and Hume, both of which charge by character.

Some basic analysis and math with Gemini says that if audio token output token calculation is the same as text tokens, then Gemini TTS is about two orders of magnitude cheaper than ElevenLabs or Hume. That can’t be right. Something is missing or Gemini and I did not calculate correctly.

There are no special ‘audio tokens’ - they’re all just tokens. The Gemini docs state:

“Gemini represents each second of audio as 32 tokens; for example, one minute of audio is represented as 1,920 tokens.”

If you want to know the exact amount you can use the File API to upload an audio clip and call ‘count tokens’ on it.

1 Like