Hi folks — I’m integrating the Gemini API via REST v1beta and I’m seeing what looks like a model/endpoint inconsistency on gemini-3-flash-preview around thinking usage accounting (and possibly context caching usage accounting).
Environment
-
API: Gemini API REST v1beta
-
Endpoint (streaming): …/models/{model}:streamGenerateContent?alt=sse
-
Model: gemini-3-flash-preview
-
Date: Jan 2026
-
Notes: I’m parsing SSE and capturing usageMetadata from the final chunks.
Issue 1 — Thinking enabled, but thoughtsTokenCount never appears
According to the docs:
However, with gemini-3-flash-preview, usageMetadata never includes thoughtsTokenCount even when I explicitly enable thinking.
Request (redacted)
{
"generationConfig": {
"thinkingConfig": {
"includeThoughts": true,
"thinkingBudget": 128
}
}
}
Observed usageMetadata (example)
{
"promptTokenCount": 1783,
"candidatesTokenCount": 63,
"totalTokenCount": 1846
}
No thoughtsTokenCount, and totalTokenCount == promptTokenCount + candidatesTokenCount (suggesting 0 thinking tokens or missing reporting).
Control experiment
- Switching the model to gemini-2.5-flash with the same thinkingConfig returns thoughtsTokenCount as expected.
Issue 2 — Cache hit / cache accounting on Flash 3 preview (optional detail)
I also suspect implicit caching may be inconsistent on gemini-3-flash-preview (e.g., cachedContentTokenCount not showing up / prompt token counts not reflecting cache usage). Sometimes it hits previous cache sometimes not(by observing cachedTokenCount), in the same consecutive conversation. The hit rate is not 100%. The payload is the same as above.
Questions
- Is this a known limitation/bug of Flash 3 preview or did I do something wrong?
1 Like
Hey @interfish, welcome to the community!
I tried to verify the metadata with the simple script below:
import requests
import json
import os
from google.colab import userdata
API_KEY = userdata.get('api_key')
BASE_URL = "https://generativelanguage.googleapis.com/v1beta/models"
def test_thinking_metadata(model_name, use_budget=True):
print(f"\n--- Testing Model: {model_name} ---")
url = f"{BASE_URL}/{model_name}:streamGenerateContent?key={API_KEY}&alt=sse"
thinking_config = {"includeThoughts": True, "thinkingBudget": 128}
payload = {
"contents": [{"parts": [{"text": "Explain clearly why the sky is blue."}]}],
"generationConfig": {
"thinkingConfig": thinking_config
}
}
print(f"Sending request with config: {json.dumps(thinking_config)}")
try:
with requests.post(url, json=payload, stream=True) as response:
response.raise_for_status()
final_usage = None
has_thoughts = False
for line in response.iter_lines():
if line:
decoded_line = line.decode('utf-8')
if decoded_line.startswith("data:"):
try:
chunk = json.loads(decoded_line[5:])
# Check for thought parts in candidates
if "candidates" in chunk and chunk["candidates"]:
parts = chunk["candidates"][0].get("content", {}).get("parts", [])
for part in parts:
if "thought" in part and part["thought"]:
has_thoughts = True
if "usageMetadata" in chunk:
final_usage = chunk["usageMetadata"]
except json.JSONDecodeError:
pass
if final_usage:
print("\n[Usage Metadata Received]:")
print(json.dumps(final_usage, indent=2))
if "thoughtsTokenCount" not in final_usage:
print(f"\n ISSUE REPLICATED: 'thoughtsTokenCount' is MISSING in {model_name}")
else:
print(f"\n SUCCESS: 'thoughtsTokenCount' present: {final_usage['thoughtsTokenCount']}")
else:
print("\n Error: No usage metadata received.")
except Exception as e:
print(f"Request failed: {e}")
test_thinking_metadata("gemini-2.5-flash")
test_thinking_metadata("gemini-3-flash-preview")
test_thinking_metadata("gemini-3-pro-preview")
In all the cases, I got the thoughtsTokenCount from metadata.
Output:
--- Testing Model: gemini-3-flash-preview ---
Sending request with config: {"includeThoughts": true, "thinkingBudget": 128}
[Usage Metadata Received]:
{
"promptTokenCount": 8,
"candidatesTokenCount": 540,
"totalTokenCount": 1039,
"promptTokensDetails": [
{
"modality": "TEXT",
"tokenCount": 8
}
],
"thoughtsTokenCount": 491
}
✅ SUCCESS: 'thoughtsTokenCount' present: 491
--- Testing Model: gemini-3-pro-preview ---
Sending request with config: {"includeThoughts": true, "thinkingBudget": 128}
[Usage Metadata Received]:
{
"promptTokenCount": 8,
"candidatesTokenCount": 436,
"totalTokenCount": 553,
"promptTokensDetails": [
{
"modality": "TEXT",
"tokenCount": 8
}
],
"thoughtsTokenCount": 109
}
✅ SUCCESS: 'thoughtsTokenCount' present: 109
If you are using the official Google Gen AI SDK, ensure you are iterating the stream until the very end and checking the usage_medata attribute of the final response object.
Preview model’s implicit caching is significantly smaller Implicit caching. Please try configuring Explicit caching so that you have more control over cached tokens/contents.
Thank you!