We are having a quite annoying issue with Gemini 2.5 Flash thinking mode, it overshoots the thinking budget by A LOT, eating pretty much the whole output length. If happens like this:
We set the thinkingBudget to a value like 1512 and the maxOutputTokens to 4000, most of the time it keeps between 1300~2100, which is ok, doable. But about in 20% of the calls the thinking just explodes to 3000~3500, leaving almost no output for the task itself. So we increased the maxOutput to 6000, and the same issue prevails, but now overshooting to 5000~5500.
If you run the same prompt, sometimes it happens, sometimes it does not, even with a very low temperature (0.02). I know the thinkingBudget is more of a “suggestion”, but overshoot by 3x is kind of annoying.
Is this a known issue?
1 Like
Hello,
Welcome to the Forum!!
Have you tried specifying the thinking budget using the following code block:
from google import genai
from google.genai import types
client = genai.Client()
response = client.models.generate_content(
model="gemini-2.5-pro",
contents="Provide a list of 3 famous physicists and their key contributions",
config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_budget=1024)
# Turn off thinking:
# thinking_config=types.ThinkingConfig(thinking_budget=0)
# Turn on dynamic thinking:
# thinking_config=types.ThinkingConfig(thinking_budget=-1)
),
)
print(response.text)
We tried recreating your issue but observed that the thinking token count remains below specified limits.
For more detail, you can check Gemini API documentation.
Hello Lalit!
Yes, we are using the thinkingBudget correctly, we noticed that if we put a value like 1024, it tends to overflow less. But when we put per example, 1512~2048 it often explodes.
Hi,
We tried to run Gemini flash 2.5 with maxOutputTokens = 4000 and thinking_budget =1600, temperature = 0.8, we ran it 10 times and observed that the thought token count remained within the specified limit.
To reproduce your issue would you be able to share your code and prompt with us?
I have noticed the same behaviour. This issue has been raised with evidence at the ADK github issues list (Thinking config for 2.5 models · Issue #1018 · google/adk-python · GitHub).