Gemini-3-flash-preview (Gemini API / v1beta) does not return usageMetadata.thoughtsTokenCount (and cache accounting seems off) even when thinking is enabled

Hi folks — I’m integrating the Gemini API via REST v1beta and I’m seeing what looks like a model/endpoint inconsistency on gemini-3-flash-preview around thinking usage accounting (and possibly context caching usage accounting).

Environment

  • API: Gemini API REST v1beta

  • Endpoint (streaming): …/models/{model}:streamGenerateContent?alt=sse

  • Model: gemini-3-flash-preview

  • Date: Jan 2026

  • Notes: I’m parsing SSE and capturing usageMetadata from the final chunks.

Issue 1 — Thinking enabled, but thoughtsTokenCount never appears

According to the docs:

  • Thinking tokens should be visible in usageMetadata.thoughtsTokenCount for thinking models

  • UsageMetadata includes thoughtsTokenCount as an output-only field 2.

However, with gemini-3-flash-preview, usageMetadata never includes thoughtsTokenCount even when I explicitly enable thinking.

Request (redacted)

{
  "generationConfig": {
    "thinkingConfig": {
      "includeThoughts": true,
      "thinkingBudget": 128
    }
  }
}

Observed usageMetadata (example)

{
  "promptTokenCount": 1783,
  "candidatesTokenCount": 63,
  "totalTokenCount": 1846
}

No thoughtsTokenCount, and totalTokenCount == promptTokenCount + candidatesTokenCount (suggesting 0 thinking tokens or missing reporting).

Control experiment

  • Switching the model to gemini-2.5-flash with the same thinkingConfig returns thoughtsTokenCount as expected.

Issue 2 — Cache hit / cache accounting on Flash 3 preview (optional detail)

I also suspect implicit caching may be inconsistent on gemini-3-flash-preview (e.g., cachedContentTokenCount not showing up / prompt token counts not reflecting cache usage). Sometimes it hits previous cache sometimes not(by observing cachedTokenCount), in the same consecutive conversation. The hit rate is not 100%. The payload is the same as above.

Questions

  • Is this a known limitation/bug of Flash 3 preview or did I do something wrong?
1 Like

Hey @interfish, welcome to the community!

I tried to verify the metadata with the simple script below:

import requests
import json
import os
from google.colab import userdata

API_KEY = userdata.get('api_key')
BASE_URL = "https://generativelanguage.googleapis.com/v1beta/models"

def test_thinking_metadata(model_name, use_budget=True):
    print(f"\n--- Testing Model: {model_name} ---")
    
    url = f"{BASE_URL}/{model_name}:streamGenerateContent?key={API_KEY}&alt=sse"
    thinking_config = {"includeThoughts": True, "thinkingBudget": 128}
    payload = {
        "contents": [{"parts": [{"text": "Explain clearly why the sky is blue."}]}],
        "generationConfig": {
            "thinkingConfig": thinking_config
        }
    }
    
    print(f"Sending request with config: {json.dumps(thinking_config)}")
    
    try:
        with requests.post(url, json=payload, stream=True) as response:
            response.raise_for_status()
            
            final_usage = None
            has_thoughts = False
            
            for line in response.iter_lines():
                if line:
                    decoded_line = line.decode('utf-8')
                    if decoded_line.startswith("data:"):
                        try:
                            chunk = json.loads(decoded_line[5:])
                            
                            # Check for thought parts in candidates
                            if "candidates" in chunk and chunk["candidates"]:
                                parts = chunk["candidates"][0].get("content", {}).get("parts", [])
                                for part in parts:
                                    if "thought" in part and part["thought"]:
                                        has_thoughts = True
                  
                            if "usageMetadata" in chunk:
                                final_usage = chunk["usageMetadata"]
                                
                        except json.JSONDecodeError:
                            pass

            if final_usage:
                print("\n[Usage Metadata Received]:")
                print(json.dumps(final_usage, indent=2))
                
                if "thoughtsTokenCount" not in final_usage:
                    print(f"\n ISSUE REPLICATED: 'thoughtsTokenCount' is MISSING in {model_name}")
                else:
                    print(f"\n SUCCESS: 'thoughtsTokenCount' present: {final_usage['thoughtsTokenCount']}")
            else:
                print("\n Error: No usage metadata received.")

    except Exception as e:
        print(f"Request failed: {e}")


test_thinking_metadata("gemini-2.5-flash")

test_thinking_metadata("gemini-3-flash-preview")

test_thinking_metadata("gemini-3-pro-preview")

In all the cases, I got the thoughtsTokenCount from metadata.

Output:

--- Testing Model: gemini-3-flash-preview ---
Sending request with config: {"includeThoughts": true, "thinkingBudget": 128}

[Usage Metadata Received]:
{
  "promptTokenCount": 8,
  "candidatesTokenCount": 540,
  "totalTokenCount": 1039,
  "promptTokensDetails": [
    {
      "modality": "TEXT",
      "tokenCount": 8
    }
  ],
  "thoughtsTokenCount": 491
}

✅ SUCCESS: 'thoughtsTokenCount' present: 491

--- Testing Model: gemini-3-pro-preview ---
Sending request with config: {"includeThoughts": true, "thinkingBudget": 128}

[Usage Metadata Received]:
{
  "promptTokenCount": 8,
  "candidatesTokenCount": 436,
  "totalTokenCount": 553,
  "promptTokensDetails": [
    {
      "modality": "TEXT",
      "tokenCount": 8
    }
  ],
  "thoughtsTokenCount": 109
}

✅ SUCCESS: 'thoughtsTokenCount' present: 109

If you are using the official Google Gen AI SDK, ensure you are iterating the stream until the very end and checking the usage_medata attribute of the final response object.

Preview model’s implicit caching is significantly smaller Implicit caching. Please try configuring Explicit caching so that you have more control over cached tokens/contents.

Thank you!