Hi folks,
Thank you for your patience while we investigated the recent cost surge reported on the Gemini 3 Flash model. We have completed our root cause analysis, and we’ve identified that the increase is due to a recently fixed billing bug where we were previously undercharging for certain usage. What you are seeing now is the correct, intended billing amount.
To clarify exactly what changed, here is the nuance behind the behavior:
-
By default, if you don’t explicitly specify any thinking level settings in your request, the Gemini 3 Flash model defaults to HIGH / dynamic thinking level. This is intended behavior and has been the same since we launched this model and will continue being so.
-
However, previously, there was a bug where users who relied on this default behavior (leaving the thinking configuration blank) were receiving those generated thinking tokens for free. On March 16th, we deployed a fix to our billing system. As of that date, default requests are now accurately being billed for the thinking tokens they generate.
We will not be retrospectively charging usage for these tokens that were offered for free, so you don’t have to worry about additional charges, but we’re also not planning to offer refunds as we were underchaging.
Your options going forward:
If cost is a strict priority for you, you can adjust the model’s thinking behavior.
I also would like to offer a small correction to previous guidance by Nabila: first, I recommend caution before trying to minimize thinking before experimenting with other settings first. Gemini models are trained for thinking and deliver better results with thinking mode enabled. So, depending on your specific use case, minimizing thinking may result in lower performance and response quality. You can experiment with different settings (e.g low thinking, medium thinking, etc) I would suggest running a small evaluation on your end before pushing this change to production. You can also inspect your thinking token usage to get a better sense of the cost-to-performance tradeoff for your specific configurations.
Second, if you still would like to minimize thinking behavior, you need to explicitly set thinking level to MINIMAL. (it’s not a boolean flag) The linked documentation should walk you through this.
I also need to note that you can’t actually disable thinking entirely (there’s no way to do so, as the model is trained to think) but setting it to minimal limits it to the minimum that it can, in most cases model won’t think, for some complex tasks like coding, it might generate some thinking tokens.
Thank you again for raising this, feel free to reach out here or directly to me with questions and feedback. I do my best to read and respond to them.
Thanks,
Ali - Product Lead for Gemini API