Summary
We observe a significant and reproducible degradation in multimodal (image → text) output quality when migrating from the deprecated google.generativeai SDK (v0.8.6) to the new google.genai SDK (v1.68.0) on gemini-3-flash-preview. No combination of parameters in the new SDK reproduces the quality of the old SDK with zero configuration.
Use case
OCR/transcription of historical manuscript images (15th century Spanish legal documents). The model receives a JPEG image + a short transcription prompt and produces a diplomatic transcription.
Empirical results
6 identical images, identical prompt, identical model (gemini-3-flash-preview), tested systematically:
| SDK | Config | Avg CER | Tokens out | Thinking tokens | Repetition loops |
|---|---|---|---|---|---|
google.generativeai |
None (zero config) | 50.5% | 687–1400 | 0 | 0/6 |
google.genai |
ThinkingConfig(thinking_level="MINIMAL") + temperature=1.0 |
69% | 700–1800 | 0 | 1/6 |
google.genai |
temperature=1.0 only (no ThinkingConfig) |
88.8% | 1196–5004 | ~1964 | 1/6 |
google.genai |
ThinkingConfig(thinking_level="MINIMAL") + temperature=0 |
N/A | 2044 (loop) | 0 | 4/6 |
| Either SDK | max_output_tokens=2048 (any other params) |
91% | ~80 | varies | 0/6 |
CER = Character Error Rate vs ground truth transcription. Lower is better.
Issues identified
1. New SDK activates thinking by default (~1964 tokens)
Without explicit ThinkingConfig, the new SDK generates ~1964 thinking tokens internally. This causes:
-
LaTeX formatting: the model wraps special characters in LaTeX notation (
$\textsf{\~{q}}$) instead of transcribing them -
Character spacing: letters separated by spaces (
E n n d e d t h u x p o h i j o s) -
3.5× slower: ~43s/page vs ~12s/page
-
CER degrades from 50.5% to 88.8%
The old SDK with zero config produces 0 thinking tokens and clean transcription output.
2. thinking_level="MINIMAL" + temperature=0 triggers repetition loops
4 out of 6 images trigger an infinite repetition loop (e.g., d̃ d̃ d̃ d̃...) that generates 65,536 tokens until the limit. This is consistent with the known repetition loop bug documented in other forum threads.
The loops are stochastic: the same image may or may not loop depending on temperature (0.3 = OK, 0.5 = LOOP, 0.7 = OK on the same image). temperature=1.0 reduces but does not eliminate loops (1/6 pages still loops).
Cost impact is severe. A single repetition loop generates ~65,536 output tokens at $3.50/M tokens = $0.23 per looped call. A normal Vision call costs ~$0.005. That is a 46× cost overrun per affected page. In our testing session, we spent over $4.50 in a single day primarily on uncontrolled repetition loops before identifying the root cause. For a production pipeline processing thousands of pages, this is a billing risk that makes the new SDK unusable without a reliable loop prevention mechanism.
Since frequency_penalty and presence_penalty are not supported on Gemini 2.5+ and 3 models, and max_output_tokens changes the generation strategy instead of simply truncating, there is currently no server-side mechanism to prevent or limit the cost of repetition loops.
3. max_output_tokens changes model behavior drastically
Setting max_output_tokens=2048 on either SDK causes the model to produce only ~80 tokens (3-4 lines) instead of a full 30-40 line transcription, resulting in ~91% CER. The model appears to switch to a “summary” mode when a token limit is imposed, rather than producing a truncated full transcription.
This makes max_output_tokens unusable as a cost safety net for this use case.
Expected behavior
The new google.genai SDK should be able to reproduce the output quality of the deprecated google.generativeai SDK with equivalent parameters. Specifically:
-
Zero-config multimodal calls should produce equivalent output
-
thinking_level="MINIMAL"should not trigger repetition loops -
A
thinking_level=0orthinking_budget=0option should be available for Gemini 3 models. The current minimum (MINIMAL) still generates thinking tokens that are billed and that degrade output quality on certain multimodal workloads. For use cases like OCR/transcription where reasoning is unnecessary and counterproductive, there is no way to fully disable thinking on Gemini 3 — forcing users to pay for unwanted computation that worsens results. -
max_output_tokensshould truncate output, not change the generation strategy
Environment
-
Python 3.12
-
google-genai==1.68.0 -
google-generativeai==0.8.6 -
Model:
gemini-3-flash-preview -
Ubuntu 24.04 LTS
-
Images: 1086×1541 px JPEG, 238 KB, historical manuscript pages
Workaround
We currently use the deprecated google.generativeai SDK with zero configuration for multimodal Vision calls. This is not sustainable long-term as the package is end-of-life since November 2025.
Pricing transparency concern
We are not the only ones affected. this forum thread reports that since March 16, 2026, Gemini 3 Flash costs have exploded ~4× overnight despite lower token usage. The suspected cause: thinking tokens are billed at the same rate as output tokens, with no way to opt out on Gemini 3 models.
In our case, the new SDK generates ~1,964 thinking tokens per Vision call that the old SDK does not. At $3.50/M output tokens, that is an extra $0.007/call — purely for internal reasoning that degrades our output quality (88.8% CER vs 50.5% without thinking).
The effective cost comparison is damning:
| Provider | Model | Cost/page (Vision OCR) | CER quality |
|---|---|---|---|
| Google (old SDK, zero config) | Gemini 3 Flash | ~$0.005 | 50.5% |
| Google (new SDK, with thinking) | Gemini 3 Flash | ~$0.012 | 69-89% |
| Google (new SDK, loop incident) | Gemini 3 Flash | $0.23 | N/A |
| Anthropic | Claude Sonnet 4.6 | ~$0.015 | Stable, no loops |
Gemini 3 Flash is advertised as a low-cost alternative, but with the new SDK’s mandatory thinking overhead and loop risk, the effective cost approaches or exceeds Claude Sonnet — for significantly worse quality on our multimodal OCR workload. The only way to achieve the advertised pricing is to use the deprecated SDK, which is end-of-life.
Reproduction
We can provide the exact images and prompt on request. The behavior is 100% reproducible across multiple runs.