I’m filing this as a billing and API correctness issue. I have two days of logged `usageMetadata`
from a batch job running `gemini-flash-latest` (Gemini 3.5 Flash) with
`thinkingConfig: { thinkingLevel: ‘low’ }` explicitly set mid-way upon discovery. The data shows a pattern of sporadic but severe thinking token runaway that, in aggregate, is producing bills inconsistent with the task complexity.
## Setup
- SDK: `@google/genai` (new SDK)
- Model: `gemini-flash-latest` (Gemini 3.5 Flash)
- `thinkingConfig: { thinkingLevel: ‘low’ }` set in every call
- Two call types: `googleSearch`-grounded calls (URL discovery, ~42-55 input tokens; and company classification, ~1,960 input tokens) and plain JSON extraction calls (~2k input tokens)
## Aggregate stats - Jun 3 and 4, 2026 (391 calls)
| Metric | Value |
|—|—|
| Total calls | 391 |
| Total input tokens | 741,608 |
| Total output (answer) tokens | 50,748 |
| Total thought tokens | 2,694,099 |
| Total billed tokens | 3,486,455 |
| Thought/input ratio | **3.6x** |
| Calls with >10k thought tokens | 40 (10% of calls) |
| Thought tokens from those 40 calls | 2,414,368 (90% of all thought costs) |
| Calls with >60k thought tokens | 36 |
The problem is not uniformly high thinking - it is a **bimodal distribution**: the majority of calls stay in the 700-2,000 thought-token range as expected for `low`, but roughly 10% of calls spike to 60k-65k thoughts unpredictably, on inputs that are no more complex than the well-behaved calls.
## Smoking-gun example - URL discovery call
```js
// Prompt (42 tokens):
'On the website http//www.angelssante.fr/, what is the URL of the page listing
the portfolio companies (investments) of “Angel Sante”? Reply with only the single URL.’
// Config:
{ tools: [{ googleSearch: {} }], temperature: 0.1, thinkingConfig: { thinkingLevel: ‘low’ } }
```
Observed `usageMetadata`:
```
promptTokenCount: 42
candidatesTokenCount: 0 ← no answer produced
thoughtsTokenCount: 63,853
totalTokenCount: 63,895
```
The model burned 63,853 thought tokens - **1,520x the input size** - and produced zero output.
This is not a one-off: 36 calls in this two-day window hit the 60k-65k ceiling, several also with `candidatesTokenCount: 0`.
## Billing impact
The Gemini API billing dashboard shows SGD 435.56 consumed against a SGD 600 monthly cap - with only 4 days elapsed in June.
The job running these calls was the primary workload during those high-spend days. Given that 89% of billed tokens in the logged window were thinking tokens (2.69M of 3.49M total), and the task complexity does not justify that ratio, we believe a substantial portion of that total billing reflects the same runaway-thinking pattern across earlier runs where we were not yet capturing `usageMetadata`.
## What I have already done
- Migrated this workload to `gemini-3.1-flash-lite` with `thinkingLevel: ‘minimal’` for grounded calls - the calls now behave correctly.
- The fix confirms the issue was model-specific to 3.5 Flash: identical prompts on `gemini-3.1-flash-lite` do not exhibit the spike pattern.
## Questions for Google
1. Is `thinkingLevel: ‘low’` respected when the `googleSearch` tool is active, or does the search path bypass the thinking budget?
2. Is `candidatesTokenCount: 0` with `thoughtsTokenCount: ~64k` a known failure mode? It appears the model hits an internal limit mid-reasoning and aborts without producing output - but the thinking tokens are still billed.
3. Is there a billing review path for the spike calls? The `usageMetadata` logs provide exact timestamps and token counts for every anomalous call. Please DM me and I can share the relevant details privately.
The pattern - 10% of calls at `thinkingLevel: ‘low’` consuming 90% of thought-token costs, with the worst cases producing zero output - does not appear to be intended behaviour of the `low`setting.
Thank you for your attention towards this matter!