Setting maxOutputTokens
in the GenerationConfig
object to way less than the outputTokenLimit
that you get from list_models()
should effectively truncate how much output the model will generate. Instead of returning the output generated so far it gives back empty content and finishReason
MAX_TOKENS.
There is no other effective approach to strictly limit the amount of output the model will generate (and yes, I am aware of prompting techniques asking the model to be brief, to produce a short story and such). The key difference is quantitative specification, not qualitative specification. So, people try to use maxOutputTokens
for that purpose and get disappointed when the model returns zero content.
Confirmed that it’s the same behavior with gemini-1.5-flash-latest. The model gives you no content:
{
"candidates": [
{
"finishReason": "MAX_TOKENS",
"index": 0,
"safetyRatings": [
{
"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
"probability": "NEGLIGIBLE"
},
{
"category": "HARM_CATEGORY_HATE_SPEECH",
"probability": "NEGLIGIBLE"
},
{
"category": "HARM_CATEGORY_HARASSMENT",
"probability": "NEGLIGIBLE"
},
{
"category": "HARM_CATEGORY_DANGEROUS_CONTENT",
"probability": "NEGLIGIBLE"
}
]
}
],
"usageMetadata": {
"promptTokenCount": 33,
"candidatesTokenCount": 15,
"totalTokenCount": 48
}
}