Latest @google/genai with 2.5 flash ignoring thinking budget

cor · September 5, 2025, 2:41pm

I am asking gemini 2.5 flash with the latest genai js sdk to extract information from a png image, I call the generateContent command and pass a config like so

config: {
            systemInstruction: `look at the picture and extract the parts`,
            temperature: 0,
            thinkingConfig: {
              thinkingBudget: 4096,
            },
            maxOutputTokens: 8096
}

I also pass a json schema as well. The problem is the model seems to be ignoring the thinking budget and having “runaway thoughts” if i increase the max tokens it will think until it hits the max tokens.

usageMetadata: {
[0] promptTokenCount: 1886,
[0] totalTokenCount: 9981,
[0] trafficType: ‘ON_DEMAND’,
[0] promptTokensDetails: [ [Object], [Object] ],
[0] thoughtsTokenCount: 8095
[0] }

EDIT:
I just removed the json schema from the call and it seems to respect the thinking budget now. so it appears that is the problem. Also if i reduce the thinkingbudget to like 1024 then it seems to respect it more often.

cor · September 10, 2025, 5:42pm

This is a pretty bad bug that almost makes gemini unusable with any complex thinking where you need an output format. When will this be fixed?

cor · September 17, 2025, 2:54am

Can someone from google please look at this?

cor · September 26, 2025, 2:47pm

Really shocked google hasnt even responded or acknowledged this issue. it makes thinking unusable with json schemas

Justine_Chang · September 26, 2025, 3:28pm

exactly the same thing I’m experiencing

Krish_Varnakavi1 · October 2, 2025, 12:55am

Hi @Justine_Chang,

Can you share a scenario where such behavior is observed.

It helps us to reproduce and investigate the issue.

cor · October 2, 2025, 2:42pm

I shared the steps above, you give it a thinking budget with a json response schema. my config has a thinkingConfig with a budget of 4096. if i ask gemini to extract the parts from an image it will run over the budget until it runs out of tokens. if I remove the json schema then it respects the limit.

Justine_Chang · October 3, 2025, 1:54pm

I would say the instructions from cor are correct.
It doesn’t happen all the time, maybe 1 in 50 times, but I’m processing thousands and thousands of times, so that’s how it can happen.

I think if you have a script that makes a request to Gemini 2.5 Flash, and you request for structured output.

Loop it 1000 times, you should hit it at least once.

Joe1 · October 3, 2025, 3:48pm

Im experiencing the same issue, but only with the latest gemini-2.5-flash-preview-09-2025 model. The code in my post reproduces it every time. It only happens when you request JSON output.

Krish_Varnakavi1 · October 9, 2025, 7:14pm

Hi @all,

Thanks for flagging this issue.

A fix has been rolled-out for this issue early this week.

I just tested using gemini Flash-Lite-Latest model and it’s working fine.. Please check if you are still facing this issue.

Joe1 · October 10, 2025, 5:42pm

I just tested using gemini Flash-Lite-Latest model and it’s working fine.. Please check if you are still facing this issue.

I just tested thinking_budget=0 on gemini-2.5-flash-preview-09-2025 and it continues to output thinking tokens, as discussed here.

UndeadBane · December 2, 2025, 6:27pm

Unfortunately, it still continues to do so.

Called model with thoughs_token_count of 1024, got consistently responses with stats like this:

{“total_token_count”: 20331, “prompt_token_count”: 8740, “candidates_token_count”: 2681, “cached_token_count”: 0, “thoughts_token_count”: 8910, “tool_use_token_count”: 0}

Sometimes reaching into 50K range.

It mostly respects 0 (and then total token count for this exact request is ~8k), but it pretty much completely ignores anything but 0 more times than not.

I’m performing the request with gemini-2.5-flash model, without specifying the exact release - so, seems to me, this is the latest stable 2.5 version.

UPD

Just tried it with gemini-flash-latest, thoughs_token_count of 1024, got: {“total_token_count”: 31065, “prompt_token_count”: 11362, “candidates_token_count”: 4428, “cached_token_count”: 0, “thoughts_token_count”: 15275, “tool_use_token_count”: 0}

It doesn’t seem to respect it either.

I had a theory that it may have something to do with the number of Part items in the request - this doesn’t seem related either, it outputs very high thoughts numbers even if everything is crammed into the single part.

Topic		Replies	Views
Gemini 2.5 Flash Overthinking by a lot Gemini API prompt , gemini-2	6	454	September 5, 2025
`max_output_tokens` isn't respected when using `gemini-2.5-flash` model Gemini API bug	7	500	October 4, 2025
Gemini-2.5-flash-preview-09-2025 breaks the thinking_budget parameter Gemini API bug , gemini-flash-2-5	3	358	October 21, 2025
Gemini-2.5-flash generates infinite token sequences Gemini API api , models , gemini-flash-2-5	5	150	December 16, 2025
Gemini-2.5-flash-preview-04-17 not honoring thinking_budget=0 Gemini API help_request	5	1588	April 22, 2025

Latest @google/genai with 2.5 flash ignoring thinking budget

Related topics