Bug Report: google.genai SDK produces significantly degraded multimodal output vs deprecated google.generativeai on Gemini 3 Flash Preview

,

Summary

We observe a significant and reproducible degradation in multimodal (image → text) output quality when migrating from the deprecated google.generativeai SDK (v0.8.6) to the new google.genai SDK (v1.68.0) on gemini-3-flash-preview. No combination of parameters in the new SDK reproduces the quality of the old SDK with zero configuration.

Use case

OCR/transcription of historical manuscript images (15th century Spanish legal documents). The model receives a JPEG image + a short transcription prompt and produces a diplomatic transcription.

Empirical results

6 identical images, identical prompt, identical model (gemini-3-flash-preview), tested systematically:

SDK Config Avg CER Tokens out Thinking tokens Repetition loops
google.generativeai None (zero config) 50.5% 687–1400 0 0/6
google.genai ThinkingConfig(thinking_level="MINIMAL") + temperature=1.0 69% 700–1800 0 1/6
google.genai temperature=1.0 only (no ThinkingConfig) 88.8% 1196–5004 ~1964 1/6
google.genai ThinkingConfig(thinking_level="MINIMAL") + temperature=0 N/A 2044 (loop) 0 4/6
Either SDK max_output_tokens=2048 (any other params) 91% ~80 varies 0/6

CER = Character Error Rate vs ground truth transcription. Lower is better.

Issues identified

1. New SDK activates thinking by default (~1964 tokens)

Without explicit ThinkingConfig, the new SDK generates ~1964 thinking tokens internally. This causes:

  • LaTeX formatting: the model wraps special characters in LaTeX notation ($\textsf{\~{q}}$) instead of transcribing them

  • Character spacing: letters separated by spaces (E n n d e d t h u x p o h i j o s)

  • 3.5× slower: ~43s/page vs ~12s/page

  • CER degrades from 50.5% to 88.8%

The old SDK with zero config produces 0 thinking tokens and clean transcription output.

2. thinking_level="MINIMAL" + temperature=0 triggers repetition loops

4 out of 6 images trigger an infinite repetition loop (e.g., d̃ d̃ d̃ d̃...) that generates 65,536 tokens until the limit. This is consistent with the known repetition loop bug documented in other forum threads.

The loops are stochastic: the same image may or may not loop depending on temperature (0.3 = OK, 0.5 = LOOP, 0.7 = OK on the same image). temperature=1.0 reduces but does not eliminate loops (1/6 pages still loops).

Cost impact is severe. A single repetition loop generates ~65,536 output tokens at $3.50/M tokens = $0.23 per looped call. A normal Vision call costs ~$0.005. That is a 46× cost overrun per affected page. In our testing session, we spent over $4.50 in a single day primarily on uncontrolled repetition loops before identifying the root cause. For a production pipeline processing thousands of pages, this is a billing risk that makes the new SDK unusable without a reliable loop prevention mechanism.

Since frequency_penalty and presence_penalty are not supported on Gemini 2.5+ and 3 models, and max_output_tokens changes the generation strategy instead of simply truncating, there is currently no server-side mechanism to prevent or limit the cost of repetition loops.

3. max_output_tokens changes model behavior drastically

Setting max_output_tokens=2048 on either SDK causes the model to produce only ~80 tokens (3-4 lines) instead of a full 30-40 line transcription, resulting in ~91% CER. The model appears to switch to a “summary” mode when a token limit is imposed, rather than producing a truncated full transcription.

This makes max_output_tokens unusable as a cost safety net for this use case.

Expected behavior

The new google.genai SDK should be able to reproduce the output quality of the deprecated google.generativeai SDK with equivalent parameters. Specifically:

  • Zero-config multimodal calls should produce equivalent output

  • thinking_level="MINIMAL" should not trigger repetition loops

  • A thinking_level=0 or thinking_budget=0 option should be available for Gemini 3 models. The current minimum (MINIMAL) still generates thinking tokens that are billed and that degrade output quality on certain multimodal workloads. For use cases like OCR/transcription where reasoning is unnecessary and counterproductive, there is no way to fully disable thinking on Gemini 3 — forcing users to pay for unwanted computation that worsens results.

  • max_output_tokens should truncate output, not change the generation strategy

Environment

  • Python 3.12

  • google-genai==1.68.0

  • google-generativeai==0.8.6

  • Model: gemini-3-flash-preview

  • Ubuntu 24.04 LTS

  • Images: 1086×1541 px JPEG, 238 KB, historical manuscript pages

Workaround

We currently use the deprecated google.generativeai SDK with zero configuration for multimodal Vision calls. This is not sustainable long-term as the package is end-of-life since November 2025.

Pricing transparency concern

We are not the only ones affected. this forum thread reports that since March 16, 2026, Gemini 3 Flash costs have exploded ~4× overnight despite lower token usage. The suspected cause: thinking tokens are billed at the same rate as output tokens, with no way to opt out on Gemini 3 models.

In our case, the new SDK generates ~1,964 thinking tokens per Vision call that the old SDK does not. At $3.50/M output tokens, that is an extra $0.007/call — purely for internal reasoning that degrades our output quality (88.8% CER vs 50.5% without thinking).

The effective cost comparison is damning:

Provider Model Cost/page (Vision OCR) CER quality
Google (old SDK, zero config) Gemini 3 Flash ~$0.005 50.5%
Google (new SDK, with thinking) Gemini 3 Flash ~$0.012 69-89%
Google (new SDK, loop incident) Gemini 3 Flash $0.23 N/A
Anthropic Claude Sonnet 4.6 ~$0.015 Stable, no loops

Gemini 3 Flash is advertised as a low-cost alternative, but with the new SDK’s mandatory thinking overhead and loop risk, the effective cost approaches or exceeds Claude Sonnet — for significantly worse quality on our multimodal OCR workload. The only way to achieve the advertised pricing is to use the deprecated SDK, which is end-of-life.

Reproduction

We can provide the exact images and prompt on request. The behavior is 100% reproducible across multiple runs.

We are looking into this. I will update this thread with my findings.

Thank you Mustan will DM project number.