Are audio output tokens equal to text output tokens?

There are no special ‘audio tokens’ - they’re all just tokens. The Gemini docs state:

“Gemini represents each second of audio as 32 tokens; for example, one minute of audio is represented as 1,920 tokens.”

If you want to know the exact amount you can use the File API to upload an audio clip and call ‘count tokens’ on it.

1 Like