I’m trying to calculate the cost for a single multi-modal request for Gemini 2.0 Flash, because I need to charge this amount to my user.
In the documentation here Vertex AI Pricing | Generative AI on Vertex AI | Google Cloud it states that
1M Input tokens has a cost of $0.15
1M Input audio tokens has a cost of $1.00
1M Output text tokens has a cost of $0.60
In the SDK, I am returned the following usage metadata:
prompt_token_count
candidates_token_count
total_token_count
My understanding is that if the input was only text, then the cost would be
cost = (0.15 / 1000000) * prompt_token_count + (0.6 / 1000000) * candidates_token_count
My question is:
Does the prompt_token_count include both the text input tokens and the audio input tokens? If so, how do I know how many tokens are due to text input vs audio input?