Hi, I’m using the Gemini 2.5 Pro model with OpenAI compatibility for my project. I’ve tried setting max_completion_tokens to None , 32000 , and other values, but the response always stops at around 13,000 tokens. I want to explicitly limit the output tokens — is there a way to enforce this?
Hello,
Just to confirm my understanding of your issue, you are trying to limit the maximum output tokens below a certain value using OpenAI compatibility, but you are facing difficulties achieving that. Is that correct?
Yes, that’s correct. I want to explicitly set a lower limit for the output tokens (e.g., limit the completion to around 3,000 tokens), but the model keeps returning small responses in the outputs (~2k tokens) even when I change the max_completion_tokens parameter. Is there a reliable way to enforce a stricter maximum completion length when using the OpenAI-compatible endpoint for Gemini 2.5 Pro?
Could you please share part of your code with model details and configurations, so that I can try to reproduce your issue?
Sure, here is the relevant part of my code that calls the model (for context, the use case is code generation):
class GeminiClient(BaseLLMClient):
"""Gemini-specific implementation of the LLM client."""
DEFAULT_MODEL = "gemini-2.5-pro"
GEMINI_BASE_URL = "https://generativelanguage.googleapis.com/v1beta/openai/"
def __init__(self, model_name: str = DEFAULT_MODEL, max_retries: int = 2):
"""
Initialize the Gemini client with a default model.
Args:
model_name: The name of the model to use
max_retries: Maximum number of retries for API calls
"""
super().__init__(model_name, max_retries)
def _get_api_key(self) -> str:
"""Get the Gemini API key."""
api_key = settings.gemini_api_key
if not api_key:
raise ValueError("GEMINI_API_KEY environment variable not set.")
return api_key
def _initialize_client(self) -> None:
"""Initialize the Gemini client."""
api_key = self._get_api_key()
self._client = OpenAI(
api_key=api_key,
base_url=self.GEMINI_BASE_URL,
)
def _call_api(self, params: GenerationParams) -> Any:
"""Call the Gemini API."""
system_prompt = params.system_prompt or SYSTEM_PROMPT
user_message = self._create_user_message(params.user_input)
return self.client.chat.completions.create(
model=self.model_name,
reasoning_effort="low",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message},
],
max_completion_tokens=params.max_completion_tokens,
temperature=params.temperature,
# response_format={"type": "json_object"},
)
class GenerationParams:
"""Parameters for text generation."""
def __init__(
self,
user_input: str,
max_completion_tokens: Optional[int] = 32000,
system_prompt: Optional[str] = None,
temperature: float = 0.0,
):
self.user_input = user_input
self.max_completion_tokens = max_completion_tokens
self.system_prompt = system_prompt
self.temperature = temperature
max_completion_tokens is used to set an upper limit on token generation, i.e. the maximum number of tokens that can be produced in a response. Unfortunately, there is no parameter available to define a lower bound.
So does that mean the amount of tokens generated basically depends on the LLM itself? Do you have any tips to prevent it from generating short answers?
Minimum token requirements depend on the model. However, refining your prompt can significantly improve output quality. Tips:
-
Give step-by-step instructions.
-
Be precise about the expected content and format.
-
Include one or two examples.
-
Specify constraints (tone, length, schema) and provide any necessary context.
Here are some quick reads for prompt engineering:
1. Google - prompt engineering
2. Medium Blog 1
3. Medium Blog 2