Single multimodal call (JPEG + short prompt), google.generativeai v0.8.6, zero config:
prompt_token_count = 1,107
candidates_token_count = 1,193
total_token_count = 4,598
→ hidden gap = 2,298 tokens (not exposed in SDK)
With longer prompts, the gap grows to 5,800–15,000 tokens/call. One call produced zero visible output but 6,109 hidden tokens.
Billing check: balance before = $8.77, after one call = ~$8.82. Delta ~$0.05 for a call that should cost ~$0.001 based on visible tokens.
We believe the gap is thinking tokens that the deprecated SDK doesn’t expose (thoughts_token_count attribute missing). thinking_budget=0 has no effect. max_output_tokens is a shared budget (thinking + output) — setting it to 2048 gives ~80 tokens of actual content.
For our use case (OCR/transcription), thinking is not just unnecessary — it actively degrades quality. With thinking enabled, the model produces LaTeX-wrapped characters, inserts spaces between letters, and hallucinates content that isn’t in the source image. We measured +18 points of character error rate with thinking vs without. We need a way to fully disable it.
gemini-2.0-flash and gemini-2.0-flash-lite both return 404 on our account, so we cannot fall back to a model without thinking.
Related: cost explosion thread
Reproduction:
python
import google.generativeai as genai
genai.configure(api_key=os.getenv("GEMINI_API_KEY"))
model = genai.GenerativeModel("gemini-3-flash-preview")
img = open("image.jpeg", "rb").read()
r = model.generate_content([{"mime_type": "image/jpeg", "data": img}, "Transcribe this."])
m = r.usage_metadata
print(m.total_token_count - m.prompt_token_count - m.candidates_token_count)
# → ~2,300 hidden tokens, billing ~$0.05 vs expected ~$0.001
Questions:
-
Does the
total_token_countgap represent thinking tokens? -
Is there a way to disable thinking on gemini-3-flash-preview?
-
What rate applies to these hidden tokens?
Python 3.12, Ubuntu 24.04, JPEG 1086×1541 px.