Gemini-3-flash-preview not returning prompt audio tokens in usage metadata when given a video file with an audio track

I’m trying to use gemini-3-flash-preview for analyzing video files, some of which have audio tracks.

I’ve noticed that when I pass it an mp4 with an audio track, the usage metadata prompt_token_details field does not contain an entry for the AUDIO modality, even though based on the response it’s clear that the model is looking at the audio. The prompt_token_count and total_token_count fields are consistent with what is returned (prompt_token_count = sum of token_count fields in prompt_token_details).

The same script with the same video file but with the model swapped to gemini-2.5-flash reports AUDIO modality tokens in prompt_tokens_details.

Are we still charged for audio tokens that aren’t showing in the usage details? It’s important for us for the response usage metadata to accurately reflect the billed tokens, so we can track our costs.

The usage metadata does contain audio tokens when passing a plain audio file (mp3).

Hi @Sahan_Reddy ,

Could you please share minimal reproducible code along with the resulting output that demonstrate this issue? Also, are you using any specific prompt for analyzing video files and how are you uploading video files(like File API, GCS)? Knowing whether you are using Google AI Studio or Vertex AI would also be very helpful for us to diagnose the issue.

This is with the Gemini API, not Vertex AI. It’s using the Files API and passing the URI to the generate_content call.

Here’s a minimal reproduction script:

import os
import subprocess
import time
import json
from pathlib import Path

import google.genai as genai
import google.genai.types as genai_types

VIDEO_URL = "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerBlazes.mp4" # video with audio
VIDEO_PATH = Path("./test_video.mp4")
if not VIDEO_PATH.exists():
    subprocess.run(["curl", "-L", "-o", str(VIDEO_PATH), VIDEO_URL], check=True)

client = genai.Client(
    api_key=os.getenv("GEMINI_API_KEY"),
    vertexai=False,
    http_options=genai_types.HttpOptions(timeout=60_000)
)

uploaded = client.files.upload(file=VIDEO_PATH)
while uploaded.state != genai_types.FileState.ACTIVE:
    time.sleep(2)
    assert uploaded.name
    uploaded = client.files.get(name=uploaded.name)
print(f"Uploaded to {uploaded.uri}")

response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents=[{
        "role": "user",
        "parts": [
            {"file_data": {"file_uri": uploaded.uri, "mime_type": uploaded.mime_type}},
            {"text": "What is happening in this video?"},
        ],
    }]
)
print("Response:", response.text)
usage_str = json.dumps(response.usage_metadata.to_json_dict(), indent=2) if response.usage_metadata else "No usage metadata"
print("Usage:", usage_str)

And the output when I run that:

Uploaded to https://generativelanguage.googleapis.com/v1beta/files/92sbn4nxq3yq
Response: This video advertisement for Google Chromecast shows a person watching a TV show on their tablet and then casting the show to their TV using a Chromecast device. This provides a larger viewing experience and is shown to be easy and convenient.
Usage: {
  "candidates_token_count": 44,
  "prompt_token_count": 1343,
  "prompt_tokens_details": [
    {
      "modality": "VIDEO",
      "token_count": 1335
    },
    {
      "modality": "TEXT",
      "token_count": 8
    }
  ],
  "thoughts_token_count": 354,
  "total_token_count": 1741
}

Hi @Sahan_Reddy , Thanks for sharing the code snippet along with the output. Based on the usage metadata, total_token_count accurately matches the sum of (candidates_token_count + prompt_token_count + thoughts_token_count).

However, the prompt_token_details lists only VIDEO and TEXT modalities for your MP4, there might be a chance that the audio and video tokens tokens are being bundled under the VIDEO modality. To verify if you are being charged correctly for the audio inside your video files, could you try calling count_tokens on the same video files twice, once with the audio track and once with the removed audio track.