Gemini 2.0 thinking model returning truncated response with a blob of whitespace

I’m experimenting with a Gemini 2.0 thinking experimental model by sending a few images and text. However, the response is truncated and contains a big chunk of white space. I double-checked the code which works with the other Gemini models. Not sure if I need to add any additional parameters. Here is the code snippet which invokes the model.

        model = genai.GenerativeModel("gemini-2.0-flash-thinking-exp-01-21")
        contents = build_content_parts_for_gemini(prompt, image_parts)
        request_payload = {"contents": contents}

        if "parameters" in model_config["request_format"]:
            request_payload["generation_config"] = model_config["request_format"]["parameters"]

        try:
            response = model.generate_content(**request_payload)
            response_text = response.text if response.text else None
            
            if response_text is not None:
                return response_text, model_name
            else:
                logging.warning(f"Model {model_name} returned an empty response.")
                return None, model_name

        except Exception as e:
            logging.exception(f"Error calling model {model_name}: {e}")
            return None, model_name

Looking for any tips to get this working.

Thank you

generation_config = {
    "max_output_tokens": 2048,
    "temperature": 0.7,
    "top_p": 0.8,
    "top_k": 40
}

Add this to your request_payload and ensure you’re using the latest version of the Google AI SDK. The max_output_tokens parameter should help with the truncation issue.

When I set the max_output_token to 2048, it doesn’t have the empty buffer at the end of the message, but it still truncates the response. Is there a better way to get the full response? The same prompt and code work fine with the gemini-exp-1206 model.

I also found something interesting. The truncation behavior is affected the prompt format. I’ve two prompts - one is in markdown format and another one in the plain text. When I use the markdown prompt, it truncates the response and also adds a lot of gibberish (probably in different language) to the response.

generation_config = {
    "max_output_tokens": 8192,  # Increased token limit
    "temperature": 0.9,
    "stop_sequences": ["\n\n"],  # Help control response ending
    "candidate_count": 1
}

This should help avoid the gibberish output and improve response quality. The gemini-2.0-flash-thinking-exp model is still experimental, so using structured plain text prompts currently yields better results.

What is the max token limit? In this case, I’m trying to get the model provide more predictable response and that’s the reason to set to 0.7 originally. Also, I’m not sure what the candiate_count does. Should I also remove top_p and top_k values.
In the prompt, I ask the model to give the response in markdown but sometimes it still gives html format.
Sorry for asking too many questions.

The typical token limit varies by model, but for most LLMs it’s between 2048-4096 tokens.

For temperature settings:

  • Keep temperature at 0.7 for balanced creativity/consistency
  • candidate_count determines how many alternative completions to generate
  • You can remove top_p and top_k if using temperature, as they’re alternative ways to control randomness

To ensure markdown output, explicitly specify the format in your system prompt and validate the response format in your code. Here’s a simple example:

response_format = {
    "type": "markdown",
    "max_tokens": 4096,
    "temperature": 0.7
}
1 Like