thinkingLevel: 'low' producing unpredictable 60k+ thought-token spikes on trivial prompts - systemic billing impact

I’m filing this as a billing and API correctness issue. I have two days of logged `usageMetadata`

from a batch job running `gemini-flash-latest` (Gemini 3.5 Flash) with

`thinkingConfig: { thinkingLevel: ‘low’ }` explicitly set mid-way upon discovery. The data shows a pattern of sporadic but severe thinking token runaway that, in aggregate, is producing bills inconsistent with the task complexity.

## Setup

- SDK: `@google/genai` (new SDK)

- Model: `gemini-flash-latest` (Gemini 3.5 Flash)

- `thinkingConfig: { thinkingLevel: ‘low’ }` set in every call

- Two call types: `googleSearch`-grounded calls (URL discovery, ~42-55 input tokens; and company classification, ~1,960 input tokens) and plain JSON extraction calls (~2k input tokens)

## Aggregate stats - Jun 3 and 4, 2026 (391 calls)

| Metric | Value |

|—|—|

| Total calls | 391 |

| Total input tokens | 741,608 |

| Total output (answer) tokens | 50,748 |

| Total thought tokens | 2,694,099 |

| Total billed tokens | 3,486,455 |

| Thought/input ratio | **3.6x** |

| Calls with >10k thought tokens | 40 (10% of calls) |

| Thought tokens from those 40 calls | 2,414,368 (90% of all thought costs) |

| Calls with >60k thought tokens | 36 |

The problem is not uniformly high thinking - it is a **bimodal distribution**: the majority of calls stay in the 700-2,000 thought-token range as expected for `low`, but roughly 10% of calls spike to 60k-65k thoughts unpredictably, on inputs that are no more complex than the well-behaved calls.

## Smoking-gun example - URL discovery call

```js

// Prompt (42 tokens):

'On the website http//www.angelssante.fr/, what is the URL of the page listing

the portfolio companies (investments) of “Angel Sante”? Reply with only the single URL.’

// Config:

{ tools: [{ googleSearch: {} }], temperature: 0.1, thinkingConfig: { thinkingLevel: ‘low’ } }

```

Observed `usageMetadata`:

```

promptTokenCount: 42

candidatesTokenCount: 0 ← no answer produced

thoughtsTokenCount: 63,853

totalTokenCount: 63,895

```

The model burned 63,853 thought tokens - **1,520x the input size** - and produced zero output.

This is not a one-off: 36 calls in this two-day window hit the 60k-65k ceiling, several also with `candidatesTokenCount: 0`.

## Billing impact

The Gemini API billing dashboard shows SGD 435.56 consumed against a SGD 600 monthly cap - with only 4 days elapsed in June.

The job running these calls was the primary workload during those high-spend days. Given that 89% of billed tokens in the logged window were thinking tokens (2.69M of 3.49M total), and the task complexity does not justify that ratio, we believe a substantial portion of that total billing reflects the same runaway-thinking pattern across earlier runs where we were not yet capturing `usageMetadata`.

## What I have already done

- Migrated this workload to `gemini-3.1-flash-lite` with `thinkingLevel: ‘minimal’` for grounded calls - the calls now behave correctly.

- The fix confirms the issue was model-specific to 3.5 Flash: identical prompts on `gemini-3.1-flash-lite` do not exhibit the spike pattern.

## Questions for Google

1. Is `thinkingLevel: ‘low’` respected when the `googleSearch` tool is active, or does the search path bypass the thinking budget?

2. Is `candidatesTokenCount: 0` with `thoughtsTokenCount: ~64k` a known failure mode? It appears the model hits an internal limit mid-reasoning and aborts without producing output - but the thinking tokens are still billed.

3. Is there a billing review path for the spike calls? The `usageMetadata` logs provide exact timestamps and token counts for every anomalous call. Please DM me and I can share the relevant details privately.

The pattern - 10% of calls at `thinkingLevel: ‘low’` consuming 90% of thought-token costs, with the worst cases producing zero output - does not appear to be intended behaviour of the `low`setting.

Thank you for your attention towards this matter!