Gemini-3-flash-preview (Gemini API / v1beta) does not return usageMetadata.thoughtsTokenCount (and cache accounting seems off) even when thinking is enabled

interfish · January 4, 2026, 1:39am

Hi folks — I’m integrating the Gemini API via REST v1beta and I’m seeing what looks like a model/endpoint inconsistency on gemini-3-flash-preview around thinking usage accounting (and possibly context caching usage accounting).

Environment

API: Gemini API REST v1beta
Endpoint (streaming): …/models/{model}:streamGenerateContent?alt=sse
Model: gemini-3-flash-preview
Date: Jan 2026
Notes: I’m parsing SSE and capturing usageMetadata from the final chunks.

Issue 1 — Thinking enabled, but thoughtsTokenCount never appears

According to the docs:

Thinking tokens should be visible in usageMetadata.thoughtsTokenCount for thinking models
UsageMetadata includes thoughtsTokenCount as an output-only field 2.

However, with gemini-3-flash-preview, usageMetadata never includes thoughtsTokenCount even when I explicitly enable thinking.

Request (redacted)

URL (stream):
https://generativelanguage.googleapis.com/v1beta/models/gemini-3-flash-preview:streamGenerateContent?key=&alt=sse
Payload (I’m omitting systemInstruction + contents for privacy; config is unchanged):

{
  "generationConfig": {
    "thinkingConfig": {
      "includeThoughts": true,
      "thinkingBudget": 128
    }
  }
}

Observed usageMetadata (example)

{
  "promptTokenCount": 1783,
  "candidatesTokenCount": 63,
  "totalTokenCount": 1846
}

No thoughtsTokenCount, and totalTokenCount == promptTokenCount + candidatesTokenCount (suggesting 0 thinking tokens or missing reporting).

Control experiment

Switching the model to gemini-2.5-flash with the same thinkingConfig returns thoughtsTokenCount as expected.

Issue 2 — Cache hit / cache accounting on Flash 3 preview (optional detail)

I also suspect implicit caching may be inconsistent on gemini-3-flash-preview (e.g., cachedContentTokenCount not showing up / prompt token counts not reflecting cache usage). Sometimes it hits previous cache sometimes not(by observing cachedTokenCount), in the same consecutive conversation. The hit rate is not 100%. The payload is the same as above.

Questions

Is this a known limitation/bug of Flash 3 preview or did I do something wrong?

Srikanta_K_N · January 6, 2026, 7:06am

Hey @interfish, welcome to the community!

I tried to verify the metadata with the simple script below:

import requests
import json
import os
from google.colab import userdata

API_KEY = userdata.get('api_key')
BASE_URL = "https://generativelanguage.googleapis.com/v1beta/models"

def test_thinking_metadata(model_name, use_budget=True):
    print(f"\n--- Testing Model: {model_name} ---")
    
    url = f"{BASE_URL}/{model_name}:streamGenerateContent?key={API_KEY}&alt=sse"
    thinking_config = {"includeThoughts": True, "thinkingBudget": 128}
    payload = {
        "contents": [{"parts": [{"text": "Explain clearly why the sky is blue."}]}],
        "generationConfig": {
            "thinkingConfig": thinking_config
        }
    }
    
    print(f"Sending request with config: {json.dumps(thinking_config)}")
    
    try:
        with requests.post(url, json=payload, stream=True) as response:
            response.raise_for_status()
            
            final_usage = None
            has_thoughts = False
            
            for line in response.iter_lines():
                if line:
                    decoded_line = line.decode('utf-8')
                    if decoded_line.startswith("data:"):
                        try:
                            chunk = json.loads(decoded_line[5:])
                            
                            # Check for thought parts in candidates
                            if "candidates" in chunk and chunk["candidates"]:
                                parts = chunk["candidates"][0].get("content", {}).get("parts", [])
                                for part in parts:
                                    if "thought" in part and part["thought"]:
                                        has_thoughts = True
                  
                            if "usageMetadata" in chunk:
                                final_usage = chunk["usageMetadata"]
                                
                        except json.JSONDecodeError:
                            pass

            if final_usage:
                print("\n[Usage Metadata Received]:")
                print(json.dumps(final_usage, indent=2))
                
                if "thoughtsTokenCount" not in final_usage:
                    print(f"\n ISSUE REPLICATED: 'thoughtsTokenCount' is MISSING in {model_name}")
                else:
                    print(f"\n SUCCESS: 'thoughtsTokenCount' present: {final_usage['thoughtsTokenCount']}")
            else:
                print("\n Error: No usage metadata received.")

    except Exception as e:
        print(f"Request failed: {e}")


test_thinking_metadata("gemini-2.5-flash")

test_thinking_metadata("gemini-3-flash-preview")

test_thinking_metadata("gemini-3-pro-preview")

In all the cases, I got the thoughtsTokenCount from metadata.

Output:

--- Testing Model: gemini-3-flash-preview ---
Sending request with config: {"includeThoughts": true, "thinkingBudget": 128}

[Usage Metadata Received]:
{
  "promptTokenCount": 8,
  "candidatesTokenCount": 540,
  "totalTokenCount": 1039,
  "promptTokensDetails": [
    {
      "modality": "TEXT",
      "tokenCount": 8
    }
  ],
  "thoughtsTokenCount": 491
}

✅ SUCCESS: 'thoughtsTokenCount' present: 491

--- Testing Model: gemini-3-pro-preview ---
Sending request with config: {"includeThoughts": true, "thinkingBudget": 128}

[Usage Metadata Received]:
{
  "promptTokenCount": 8,
  "candidatesTokenCount": 436,
  "totalTokenCount": 553,
  "promptTokensDetails": [
    {
      "modality": "TEXT",
      "tokenCount": 8
    }
  ],
  "thoughtsTokenCount": 109
}

✅ SUCCESS: 'thoughtsTokenCount' present: 109

If you are using the official Google Gen AI SDK, ensure you are iterating the stream until the very end and checking the usage_medata attribute of the final response object.

Preview model’s implicit caching is significantly smaller Implicit caching. Please try configuring Explicit caching so that you have more control over cached tokens/contents.

Thank you!

Topic		Replies	Views
Gemini-3.1-pro-preview: thoughtsTokenCount missing with thinking_level=LOW Gemini API thinking	1	89	May 15, 2026
chat.sendMessage() keeps returning seemingly hard-coded abstract thoughts. Do they count as tokens? Gemini API models	3	89	June 11, 2026
Gemini-2.5-flash-preview-04-17 not honoring thinking_budget=0 Gemini API help_request	5	1768	April 22, 2025
Gemini-2.5-flash-preview-09-2025 breaks the thinking_budget parameter Gemini API bug , gemini-flash-2-5	3	531	October 21, 2025
Gemini-3-flash-preview: billing ~50× higher than expected — hidden thinking tokens not reported by deprecated SDK Gemini API gemini-3	2	170	March 26, 2026

Gemini-3-flash-preview (Gemini API / v1beta) does not return usageMetadata.thoughtsTokenCount (and cache accounting seems off) even when thinking is enabled

Environment

Issue 1 — Thinking enabled, but thoughtsTokenCount never appears

Issue 2 — Cache hit / cache accounting on Flash 3 preview (optional detail)

Questions

Related topics