Calculating cost for a single Gemini 2 request with audio and text

I’m trying to calculate the cost for a single multi-modal request for Gemini 2.0 Flash, because I need to charge this amount to my user.

In the documentation here Vertex AI Pricing  |  Generative AI on Vertex AI  |  Google Cloud it states that

1M Input tokens has a cost of $0.15
1M Input audio tokens has a cost of $1.00
1M Output text tokens has a cost of $0.60

In the SDK, I am returned the following usage metadata:

prompt_token_count
candidates_token_count
total_token_count

My understanding is that if the input was only text, then the cost would be

cost = (0.15 / 1000000) * prompt_token_count + (0.6 / 1000000) * candidates_token_count

My question is:

Does the prompt_token_count include both the text input tokens and the audio input tokens? If so, how do I know how many tokens are due to text input vs audio input?

Assuming you’re using the python sdk, this way you should be able to get both text and audio tokens:

    for prompt_tokens_detail in response.usage_metadata.prompt_tokens_details:
        print(f"Media: {prompt_tokens_detail.modality.name} token count: {prompt_tokens_detail.token_count}")