Proposed better handling of `MAX_TOKENS` finishReason

OrangiaNebula · May 18, 2024, 7:25pm

Setting maxOutputTokens in the GenerationConfig object to way less than the outputTokenLimit that you get from list_models() should effectively truncate how much output the model will generate. Instead of returning the output generated so far it gives back empty content and finishReason MAX_TOKENS.

There is no other effective approach to strictly limit the amount of output the model will generate (and yes, I am aware of prompting techniques asking the model to be brief, to produce a short story and such). The key difference is quantitative specification, not qualitative specification. So, people try to use maxOutputTokens for that purpose and get disappointed when the model returns zero content.

Confirmed that it’s the same behavior with gemini-1.5-flash-latest. The model gives you no content:

{
  "candidates": [
    {
      "finishReason": "MAX_TOKENS",
      "index": 0,
      "safetyRatings": [
        {
          "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
          "probability": "NEGLIGIBLE"
        },
        {
          "category": "HARM_CATEGORY_HATE_SPEECH",
          "probability": "NEGLIGIBLE"
        },
        {
          "category": "HARM_CATEGORY_HARASSMENT",
          "probability": "NEGLIGIBLE"
        },
        {
          "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
          "probability": "NEGLIGIBLE"
        }
      ]
    }
  ],
  "usageMetadata": {
    "promptTokenCount": 33,
    "candidatesTokenCount": 15,
    "totalTokenCount": 48
  }
}

grandell1234 · May 18, 2024, 7:26pm

What is your maxOutputTokens set too?

OrangiaNebula · May 18, 2024, 7:29pm

For testing purposes, to trigger it, I set it to 15. You can see it in the usageMetadata field.

grandell1234 · May 18, 2024, 7:30pm

Try setting it higher and see if you get a response.

OrangiaNebula · May 18, 2024, 7:38pm

I did. You don’t get a few dozen lines of content AND finishReason is MAX_TOKENS ever. It is an exclusive or relationship: either you get content, or you get MAX_TOKENS and no content.

The proposed change is to supply the content generated so far, stop the generation, and of course inform the user that the content was truncated by adding the MAX_TOKENS indicator in finishReason.

Thank you for helping me explain the proposed change in a more accurate and hopefully understandable way.

afirstenberg · May 20, 2024, 1:26pm

I keep feeling like there is a bug in the API gateway that is handling the response. So that it is checking the finish reason (or equivalent) and if it isn’t STOP then it blocks whatever the text would be.

Some evidence for this is that if you’re streaming, you’ll get the partial response ok, except for the last streamed chunk.

OrangiaNebula · May 20, 2024, 4:31pm

You are right. I edited the title to better reflect where the problem is, my old title was complaining about max output tokens, which isn’t the issue.

It wouldn’t surprise me if the gateway isn’t even software, it probably is a network appliance that massages the messages. Which probably makes it easy to make a fix while simultaneously making it hard to find the right operations contact to effect it.

Topic		Replies	Views
Can I increase max_output_tokens Gemini API api , models	2	679	December 18, 2024
Gemini 1.5 flash continually generating same text until reach max limit of token Gemini API gemini-15 , api	4	502	December 18, 2024
Output tokens limit for the finetuned gemini flash 1.5 Gemini API fine-tuning	12	2252	October 12, 2024
Gemini 2.5 API bug: missing finishReason when max token limit is reached Gemini API api , gemini-api	0	91	March 28, 2025
Tips on how to increase token output size in GenerateContentResponse? Gemini API gemini-15 , api , models	1	351	September 28, 2024

Proposed better handling of `MAX_TOKENS` finishReason

Related topics