`max_output_tokens` isn't respected when using `gemini-2.5-flash` model

Heya,

Currently using the Python google.genai package to interact with Gemini, I wanted to set a limit to how many tokens can be used for generating a response but it seems that when using the model gemini-2.5-flash and setting max_output_tokens=1, it doesn’t take max_output_tokens into account.

If you use the model gemini-2.0-flash, it does work, returning as a FINISH_REASON: MAX_TOKENS.

I’m completely lost as to what might be the cause, is someone available to help out?

This is the code I’ve used to test it:

Code
import pprint
from google import genai
from google.genai import types
from pydantic import BaseModel

class ObjectToProcess(BaseModel):
    product_id: str
    field_to_use: str
    field_text: str

class ProcessedObject(BaseModel):
    product_id: str
    field_used: str
    text_generated: str

client = genai.Client()

gemini_conf = types.GenerateContentConfig(
    thinking_config=types.ThinkingConfig(thinking_budget=0), # Disables thinking
    response_mime_type="application/json",
    response_schema=list[ProcessedObject],
    max_output_tokens=1,
)

product_object = [
    ObjectToProcess(product_id="00001", field_to_use="title", field_text="white nike t-shirt"), 
    ObjectToProcess(product_id="00002", field_to_use="title", field_text="beige nike t-shirt"), 
    ObjectToProcess(product_id="00003", field_to_use="title", field_text="black uniqlo linen shirt"),
    ObjectToProcess(product_id="00004", field_to_use="title", field_text="jeans death stranding special edition"),
    ObjectToProcess(product_id="00005", field_to_use="title", field_text="marketer red leather jacket"),
]

prompt = f"Take the following list of items {product_object} which includes the product id (which is referred to as id) and a title (referred to as title). Generate a description from the title."

response = client.models.generate_content(
    model="gemini-2.0-flash",
    config=gemini_conf,
    contents=prompt,
)

pprint.pprint(response)
print(response.candidates[0].finish_reason)

Thanks in advance!

Dediu

Perhaps related - Gemini-2.5-flash-preview-09-2025 breaks the thinking_budget parameter - #4 by Joe1

I’m also having this issue; isn’t it kind of rediculous? Couldn’t someone theoretically be losing a lot of money to this?

Finish reason is ‘STOP’ and there’s no other indication as to why this is happening. Here’s another simple example:

```python
from google import genai
from google.genai import types

client = genai.Client()

model = “gemini-2.5-flash”
max_tokens = 20
thinking = 0

response = client.models.generate_content(
model=model,
contents=[“Provide a sizeable fun fact about the Roman empire”],
config=types.GenerateContentConfig(
max_output_tokens=20,
thinking_config=types.ThinkingConfig(thinking_budget=thinking),
)
)
print(response.text)
print(“\n— Full API Response —”)
print(response)
print(“-------------------------\n”)
```

Hey folks, thanks for flagging this. Eng is actively working on a fix for this issue

1 Like

The fix is rolling out and will be live in the next couple hours. Thanks once again for reporting this & for your patience while we resolved the issue!

1 Like

Instead of thinking_budget=0, try setting a small, explicit thinking_budget (e.g., thinking_budget=100 or thinking_budget=500). This might help the model better manage its token usage. Some users have reported better results by setting an explicit thinking_budget rather than disabling it completely.

Instead of thinking_budget=0 , try setting a small, explicit thinking_budget

This has not worked for me. It seems to ignore it completely.

What arg to use in vertex.client.chat.completion for setting the max_tokens ? max_tokens doesn’t work I am using the GEMINI_2_5_FLASH_LITE.