With gemini-2.5-flash-preview-09-2025 Setting that has no impact. It still includes thinking tokens in the output. For example, in the returned usage_metadata it gave:
thoughts_token_count=15226 despite budget being set to 0.
This does not occur with Flash 05-20, where it correctly returns 0 thinking tokens. The net result is, Im being charged for thousands of extra tokens, and extra latency, for tokens that I explicitly requested not to receive.
Thanks for sharing! I’m trying to reproduce the issue where thinking_budget=0 isn’t being respected by the gemini-2.5-flash-preview-09-2025 model. My tests with simpler prompts are showing the expected behavior, where thoughts_token_count=None confirms the budget is being respected, as shown below
response = client.models.generate_content(
model="gemini-2.5-flash-preview-09-2025",
contents="Explain the concept of Occam's Razor and provide a simple, everyday example.",
config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_budget=0) # Disables thinking
),
print(response.usage_metadata)
To help me debug this further, could you please share the exact prompt and any other relevant configuration you used? This will help me understand if specific prompt complexities are triggering the unexpected behavior.
Interestingly, removing response_mime_type="application/json", resolves the issue, and the model consistently outputs 0 thinking tokens. But I need a json response, since I use structured outputs. gemini-2.5-flash-preview-05-20 does not exhibit this issue.