We’ve been experimenting with long requests and structured output on the Gemini 2.5 models, via the Python SDK (google.genai
package). Even while setting the max_tokens parameter to the 65535 upper bound on output tokens, though, we often receive truncated responses that are well below the limit:
config = types.GenerateContentConfig(
http_options=types.HttpOptions(timeout=600000),
temperature=0.9,
max_output_tokens=65535,
response_mime_type="application/json",
response_schema=SCHEMA,
)
client = genai.Client(api_key=API_KEY)
contents = [
types.Content(
role="user",
parts=[types.Part.from_text(text=PROMPT)],
)
]
response = await client.aio.models.generate_content(
model="gemini-2.5-flash-preview-04-17",
contents=contents,
config=config,
)
Examining responses for identical prompts, we see varying token counts that are well under 65535
GenerateContentResponseUsageMetadata(cached_content_token_count=None, candidates_token_count=32907, prompt_token_count=58154, total_token_count=123675
GenerateContentResponseUsageMetadata(cached_content_token_count=None, candidates_token_count=17224, prompt_token_count=58154, total_token_count=123676
GenerateContentResponseUsageMetadata(cached_content_token_count=None, candidates_token_count=None, prompt_token_count=58154, total_token_count=123688
sometimes we even get empty responses. Interestingly, finish_reason
is always MAX_TOKENS
and total_token_count - prompt_token_count
is always near the limit.
What explains this behavior? Are we actually already running up against output limits, even though the response doesn’t seem to reflect that? Given that we’re getting schematized responses back, truncated responses mean malformed JSON. Is there anything we can do to work around this and get properly-formed JSON responses back?