Hello, Google AI Team.
I am developing a [Character Roleplay Chat Service] using Gemini 3.0 Pro.
First, I want to share my highly positive impression of the model’s core performance.
The creativity demonstrated in chat interactions within my target language region is outstanding. In particular, the “interactions between multiple NPC characters” and the “utilization of background context” have improved significantly. Even with the “Low” reasoning setting, the narrative richness and quality far exceed what we achieved with high reasoning tokens in Gemini 2.5 Pro.
However, I am facing a critical structural issue regarding the “Low” reasoning setting.
I operate a dual-tier service using both “High” and “Low” settings.
-
High Setting (Premium): We anticipate 2,000+ reasoning tokens. We allocate a large max_output_tokens budget and charge a premium price. This works perfectly, and users are satisfied.
-
Low Setting (Standard): This is intended to be our affordable baseline option, replacing Gemini 2.5 Pro. We priced this tier assuming a reasoning budget of 300~500 tokens, with a slight price increase that is acceptable to users.
The Fatal Problem with “Low” Setting:
Since we cannot completely disable reasoning in 3.0 Pro, “Low” is our only option for the standard tier. However, even with strict system prompts, the “Low” setting often behaves unpredictably, consuming 1,000+ reasoning tokens.
Because Reasoning Tokens and Final Output Tokens share the same max_output_tokens limit:
-
Cannibalization: The inflated reasoning process eats into the budget meant for the final response.
-
Truncation: Consequently, the actual user-facing response gets cut off mid-sentence.
-
Economic Deadlock: We cannot simply increase the max_output_tokens to prevent truncation, as that would drive the costs beyond what is viable for a “Standard” price tier.
Feature Request / Immediate Fix Needed:
To make the “Low” setting commercially viable as a standard option, we need one of the following:
-
Hard Limit for Reasoning Tokens: Allow us to set a specific integer limit (e.g., max_reasoning_tokens: 300) separate from the total output. If the limit is hit, the model must stop reasoning and generate the response.
-
Budget Separation: Distinct parameters for max_reasoning_tokens and max_response_tokens.
-
“Ultra-Low” Mode: A mode strictly tuned for minimal reasoning (scratchpad level, <200 tokens) for low-latency, standard-tier applications.
Without this control, we cannot migrate our standard user base to Gemini 3.0 Pro.
Configuration:
-
Model: Gemini 3.0 Pro
-
Reasoning Setting: Low
-
Current behavior: “Low” reasoning often exceeds expectations (1000+ tokens), causing response truncation within the standard cost budget.
Thank you.