Proposed better handling of `MAX_TOKENS` finishReason

Setting maxOutputTokens in the GenerationConfig object to way less than the outputTokenLimit that you get from list_models() should effectively truncate how much output the model will generate. Instead of returning the output generated so far it gives back empty content and finishReason MAX_TOKENS.

There is no other effective approach to strictly limit the amount of output the model will generate (and yes, I am aware of prompting techniques asking the model to be brief, to produce a short story and such). The key difference is quantitative specification, not qualitative specification. So, people try to use maxOutputTokens for that purpose and get disappointed when the model returns zero content.

Confirmed that it’s the same behavior with gemini-1.5-flash-latest. The model gives you no content:

{
  "candidates": [
    {
      "finishReason": "MAX_TOKENS",
      "index": 0,
      "safetyRatings": [
        {
          "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
          "probability": "NEGLIGIBLE"
        },
        {
          "category": "HARM_CATEGORY_HATE_SPEECH",
          "probability": "NEGLIGIBLE"
        },
        {
          "category": "HARM_CATEGORY_HARASSMENT",
          "probability": "NEGLIGIBLE"
        },
        {
          "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
          "probability": "NEGLIGIBLE"
        }
      ]
    }
  ],
  "usageMetadata": {
    "promptTokenCount": 33,
    "candidatesTokenCount": 15,
    "totalTokenCount": 48
  }
}

2 Likes

What is your maxOutputTokens set too?

For testing purposes, to trigger it, I set it to 15. You can see it in the usageMetadata field.

Try setting it higher and see if you get a response.

I did. You don’t get a few dozen lines of content AND finishReason is MAX_TOKENS ever. It is an exclusive or relationship: either you get content, or you get MAX_TOKENS and no content.

The proposed change is to supply the content generated so far, stop the generation, and of course inform the user that the content was truncated by adding the MAX_TOKENS indicator in finishReason.

Thank you for helping me explain the proposed change in a more accurate and hopefully understandable way.

I keep feeling like there is a bug in the API gateway that is handling the response. So that it is checking the finish reason (or equivalent) and if it isn’t STOP then it blocks whatever the text would be.

Some evidence for this is that if you’re streaming, you’ll get the partial response ok, except for the last streamed chunk.

1 Like

You are right. I edited the title to better reflect where the problem is, my old title was complaining about max output tokens, which isn’t the issue.

It wouldn’t surprise me if the gateway isn’t even software, it probably is a network appliance that massages the messages. Which probably makes it easy to make a fix while simultaneously making it hard to find the right operations contact to effect it.

1 Like