Gemini Max Tokens

daaffalbari · August 14, 2025, 8:25am

Hi, I’m using the Gemini 2.5 Pro model with OpenAI compatibility for my project. I’ve tried setting max_completion_tokens to None , 32000 , and other values, but the response always stops at around 13,000 tokens. I want to explicitly limit the output tokens — is there a way to enforce this?

Lalit_Kumar · August 18, 2025, 6:56am

Hello,

Just to confirm my understanding of your issue, you are trying to limit the maximum output tokens below a certain value using OpenAI compatibility, but you are facing difficulties achieving that. Is that correct?

daaffalbari · August 19, 2025, 3:07am

Yes, that’s correct. I want to explicitly set a lower limit for the output tokens (e.g., limit the completion to around 3,000 tokens), but the model keeps returning small responses in the outputs (~2k tokens) even when I change the max_completion_tokens parameter. Is there a reliable way to enforce a stricter maximum completion length when using the OpenAI-compatible endpoint for Gemini 2.5 Pro?

Lalit_Kumar · August 19, 2025, 4:47am

Could you please share part of your code with model details and configurations, so that I can try to reproduce your issue?

daaffalbari · August 19, 2025, 5:05am

Sure, here is the relevant part of my code that calls the model (for context, the use case is code generation):

class GeminiClient(BaseLLMClient):
    """Gemini-specific implementation of the LLM client."""

    DEFAULT_MODEL = "gemini-2.5-pro"
    GEMINI_BASE_URL = "https://generativelanguage.googleapis.com/v1beta/openai/"

    def __init__(self, model_name: str = DEFAULT_MODEL, max_retries: int = 2):
        """
        Initialize the Gemini client with a default model.

        Args:
            model_name: The name of the model to use
            max_retries: Maximum number of retries for API calls
        """
        super().__init__(model_name, max_retries)

    def _get_api_key(self) -> str:
        """Get the Gemini API key."""
        api_key = settings.gemini_api_key
        if not api_key:
            raise ValueError("GEMINI_API_KEY environment variable not set.")
        return api_key

    def _initialize_client(self) -> None:
        """Initialize the Gemini client."""
        api_key = self._get_api_key()
        self._client = OpenAI(
            api_key=api_key,
            base_url=self.GEMINI_BASE_URL,
        )

    def _call_api(self, params: GenerationParams) -> Any:
        """Call the Gemini API."""
        system_prompt = params.system_prompt or SYSTEM_PROMPT
        user_message = self._create_user_message(params.user_input)

        return self.client.chat.completions.create(
            model=self.model_name,
            reasoning_effort="low",
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_message},
            ],
            max_completion_tokens=params.max_completion_tokens,
            temperature=params.temperature,
            # response_format={"type": "json_object"},
        )

class GenerationParams:
    """Parameters for text generation."""

    def __init__(
        self,
        user_input: str,
        max_completion_tokens: Optional[int] = 32000,
        system_prompt: Optional[str] = None,
        temperature: float = 0.0,
    ):
        self.user_input = user_input
        self.max_completion_tokens = max_completion_tokens
        self.system_prompt = system_prompt
        self.temperature = temperature

Lalit_Kumar · August 19, 2025, 8:41am

max_completion_tokens is used to set an upper limit on token generation, i.e. the maximum number of tokens that can be produced in a response. Unfortunately, there is no parameter available to define a lower bound.

daaffalbari · August 19, 2025, 12:33pm

So does that mean the amount of tokens generated basically depends on the LLM itself? Do you have any tips to prevent it from generating short answers?

Lalit_Kumar · August 20, 2025, 9:22am

Minimum token requirements depend on the model. However, refining your prompt can significantly improve output quality. Tips:

Give step-by-step instructions.
Be precise about the expected content and format.
Include one or two examples.
Specify constraints (tone, length, schema) and provide any necessary context.

Here are some quick reads for prompt engineering:
1. Google - prompt engineering
2. Medium Blog 1
3. Medium Blog 2

Topic		Replies	Views
`max_output_tokens` isn't respected when using `gemini-2.5-flash` model Gemini API bug	7	1140	October 4, 2025
Truncated responses despite being under limits Gemini API api , gemini-2-5	2	1280	June 11, 2025
Gemini 3 output limited to ~4k tokens instead of 65k Gemini API bug , api , gemini , api-key	9	1485	January 14, 2026
Can I increase max_output_tokens Gemini API api , models	2	3279	December 18, 2024
"finishReason" : "MAX_TOKENS" - But Text is Empty Gemini API prompt , rate-limits	14	3833	March 6, 2026

Gemini Max Tokens

Related topics