Looking at pricing and quota for 2.5 Flash/Pro TTS, I’m trying to figure out the best way to use it.
As I understand it, audio tokens for TTS output are used at 32 tokens per second of audio, so 1920 per minute. Max output from both TTSs is 16,000 tokens, which is about 8.33 minutes. So if I estimate my input text to be under 8 minutes of audio (probably should err conservatively), then I can do that all in one turn. But over that, I should break it into a few chunks.
Main question, is the 32 tokens per second audio output correct? And any thing I am missing in this simple workflow?