Hi!
Sometimes, when model is set to use a low thinking_level or thinking_budget to 128 or 256, it unexpectedly uses around 3,000 thought tokens, even though the task is almost identical to others. This happens with both temperature=1.0 and temperature=0.0, and it significantly affects API costs. I’d really appreciate it if this could be fixed quickly. Thank you.
For now, I solved this by adding system prompt “Please don’t think too much!”
What I experimented - temperature, top_p, thinking budget to fairly large 512 tokens - didn’t work..
and.. It also fails often. I really need some help here.
Hi @komin , Thank you for bringing this to our attention. Could you please provide an example of the prompt you are using with Gemini 3? This will help us diagnose the issue.
I’ve sent DM including my prompt. please check them. Thanks
I have the same issue. `config.thinkingConfig.thinkingLevel = “low”` works fine only for shorter contexts. Wen given a context with 5000 and more tokens, thinking also starts growing uncontrollably reaching thousands of tokens.
When iterating over chunks from `generateContentStream`, often the first 3 - 4 chunks contain thoughts only. The thoughts in every chunk are of acceptable length, but Gemini keeps elaborating on its thought headers in multiple chunks in a row, before starting to generate the answer.
I tried a system prompt with `IMPORTANT: As a large language model, do not think, generate the final response immediately. You have already used too many thought tokens and will be heavily punished for exceeding the quota.` but it did not help at all.
For example, to continue a dialogue between two characters and writing response for a single character, Gemini 3 Pro returned chunks with the following headers of thoughts, reaching thoughtsTokenCount = 1741:
**Exploring Anton's Dilemma**
**Unpacking Anton's Pressure**
**Perfecting Anton's Words**
**Crafting Anton's Ending**
**Focusing Anton's Final Words**
**Continuing Anton's Speech**
**Perfecting Anton's Ending**
The only thing that worked (but it disabled thinking completely) was to send the context with a fake last message containing
<thought></thought>
or
<think></think>
thanks for sharing your experience. I’ll try your suggestion! My prompt was about 3000 tokens. The problem is that most of the time it thinks about 100 tokens but about once every 50 prompts it’ll go wild. Interesting point is that when I forcibly decrease the thinking token, the accuracy of multimodal capabilities decreases too.. maybe it is about the model? Anyways thanks a lot!
Hi @komin ,
Apologies for the delayed response, and thank you for sharing the details. Could you please try using clear system instructions and keeping prompts concise usually helps maintain consistent and cost-efficient token usage. Please let me know if the issue persists.
Thanks.