I am playing around with the Gemini Flash/Pro TTS models, but I don’t understand the pricing.
Pricing is based on 1 million tokens as usual, but are those standard text-based tokens or are there special audio/multi-media tokens? I want to compare them to ElevenLabs and Hume, both of which charge by character.
Some basic analysis and math with Gemini says that if audio token output token calculation is the same as text tokens, then Gemini TTS is about two orders of magnitude cheaper than ElevenLabs or Hume. That can’t be right. Something is missing or Gemini and I did not calculate correctly.