To replicate:
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-pro-exp-03-25:generateContent?key=MY_API_KEY" \
-H 'Content-Type: application/json' \
-X POST \
-d '{
"contents": [
{
"parts": [
{
"text": "could"
}
]
}
],
"generationConfig": {
"temperature": 0.1,
"maxOutputTokens": 411
}
}'
Response:
{
"usageMetadata": {
"promptTokenCount": 1,
"totalTokenCount": 1,
"promptTokensDetails": [
{
"modality": "TEXT",
"tokenCount": 1
}
]
},
"modelVersion": "gemini-2.5-pro-exp-03-25"
}
This is not the same behavior as you get from Gemini 2.0 Flash:
{
"candidates": [
{
"content": {
"parts": [
{
"text": "Could you please provide more context? I need to know what you want me to do with the word \"could\". For example, are you asking me to:\n\n* **Define it?** (e.g., \"Could you define 'could'?\")\n* **Use it in a sentence?"
}
],
"role": "model"
},
"finishReason": "MAX_TOKENS",
"avgLogprobs": -0.13195424001724992
}
],
"usageMetadata": {
"promptTokenCount": 1,
"candidatesTokenCount": 61,
"totalTokenCount": 62,
"promptTokensDetails": [
{
"modality": "TEXT",
"tokenCount": 1
}
],
"candidatesTokensDetails": [
{
"modality": "TEXT",
"tokenCount": 61
}
]
},
"modelVersion": "gemini-2.0-flash"
}
With 2.0 Flash, we get a candidate with "finishReason": "MAX_TOKENS"
allowing us to determine why we got a truncated/missing response. This is easy to parse.
If I increase the maxOutputTokens
value a little bit more, Gemini 2.5 Pro will respond correctly, but truncate the response after just a few tokens. I’m guessing it spends a few hundred tokens on CoT / reasoning before generating response tokens, and if the model hits the token limit during this thinking phase, it fails to generate a candidate and spits out the unparseable response.
Expected behavior would be something like a candidate with an empty content
object and a finishReason
of MAX_TOKENS
.