Truncated Response Issue with Gemini 2.5 Flash Preview

junkx · April 28, 2025, 11:05am

Environment

Model: gemini-2.5-flash-preview-04-17
API: Gemini API (JavaScript/TypeScript)
Environment: Cloudflare Worker
Response Format: JSON with schema validation

Issue Description

We’re experiencing consistent truncation of responses from the Gemini model, despite:

Input tokens being well below limits (~3000 tokens)
Reponse well below limits (~3000 tokens)
Setting maxOutputTokens: 65536 in the config (tried without it, but same problem)
Using structured output with JSON schema
Response being truncated mid-sentence in random locations

Current Implementation

const result = await client.ai.models.generateContent({
    model: 'gemini-2.5-flash-preview-04-17',
    contents: [
        { text: prompt },
        { fileData: { fileUri, mimeType } }
    ],
    config: {
        responseMimeType: 'application/json',
        responseSchema: documentAnalysisSchema,
        thinkingConfig: { includeThoughts: false },
        maxOutputTokens: 65536
    }
});

Debugging Steps Taken

Verified input token count:

const tokenCount = await client.ai.models.countTokens({
    model: 'gemini-2.5-flash-preview-04-17',
    contents: [/*...*/]
});
// Results show ~3000 tokens, well within limits

Added console logging:

console.log('Raw response length:', response.length);
console.log('Raw response:', response);
console.log('Summary length:', parsedResponse.summary?.length);

The console output itself is truncated, suggesting a potential streaming/buffering issue

Tested with different response sizes:

Small responses work fine
Medium to large responses get truncated consistently
Truncation occurs at different points in the text

Questions

Is this a known issue with the preview version of Gemini 2.5 Flash?
Are there specific limitations when using Cloudflare Workers with the Gemini API?
Are there recommended workarounds for handling large responses in streaming environments?
Should we implement chunking on our side, and if so, what’s the recommended approach?

Additional Context

Using structured output with a schema for consistent JSON responses
Need to handle documents of varying sizes
Currently implementing document analysis with title generation, summarization, and mind mapping
Response includes markdown formatting which might affect token counts

Any insights or recommendations would be greatly appreciated. We’re particularly interested in:

Best practices for handling large responses
Recommended configuration for Cloudflare Workers
Alternative approaches to structured output handling

travisbigarmor · May 16, 2025, 2:21pm

I’m getting similarly truncated responses via the latest nodejs @google/genai package…

Error in startStreamingChat: Error: Incomplete JSON segment at the end
at ApiClient.processStreamResponse_1 (…/node_modules/google/genai/src/_api_client.ts:463:19)
at processStreamResponse_1.next ()
at resume (…/node_modules/google/genai/dist/node/index.js:2509:44)
at fulfill (…/node_modules/google/genai/dist/node/index.js:2511:31)
at processTicksAndRejections (node:internal/process/task_queues:105:5)

Shivam_Mishra · May 19, 2025, 4:32pm

Hi @junkx ,

Thanks for raising this issue. I checked the Gemini documentation and the output token limit is 65,536 , which is the maximum you are setting.

Could you also share if you are seeing any errors in your console?

Also, what is the type of your input fileUri and (prompt/System instruction) ?

Based on my understanding , the issue might be related to how the response is being handled. Because, on Vertex AI, i have checked the ‘gemini-2.5-flash-preview-04-17’ model is able to generate larger responses (more then 3000 token). Thank You!!

Shivam_Mishra · May 19, 2025, 4:42pm

Hi @travisbigarmor ,

Thanks for sharing the error screenshot. Could you please also provide minimal reproducible code?

travisbigarmor · May 19, 2025, 5:36pm

Because the error is masked (generic 500 with generic message), I don’t know what the problem is or how to reliably reproduce it. Also, this is not an open source project “Modified by moderator”

This happens when using any of the 2.5 models, or 2.0 flash with thinking.

It does not appear to happen when I switch to 2.0 pro or 2.0 flash (without thinking)

The expected return token length is probably 10k or less (not deterministic).

Shivam_Mishra · May 20, 2025, 9:12am

@junkx Following up on my last comment, I have run the code snippet you shared with the text file using the model gemini-2.5-flash-preview-04-17.

code snippet:

import {
    GoogleGenAI,
    createUserContent,
    createPartFromUri,
  } from "@google/genai";
  const documentAnalysisSchema = {};
  const ai = new GoogleGenAI({ apiKey: 'API-key' });
  
  async function main() {
    const image = await ai.files.upload({
      file: "example.txt",
    });
    const fileUri = image.uri;
    const mimeType = image.mimeType;
    const prompt = "Tell me about this text file in 5000 json keys";
    const result = await ai.models.generateContent({
        model: 'gemini-2.5-flash-preview-04-17',
        contents: [
            { text: prompt },
            { fileData: { fileUri, mimeType } }
        ],
        config: {
            responseMimeType: 'application/json',
            responseSchema: documentAnalysisSchema,
            thinkingConfig: { includeThoughts: false },
            maxOutputTokens: 65536
        }
    });
    console.log(result.text);
    console.log(result.usageMetadata)

I have observed that it truncates the response only if the token limit is exceeded; otherwise, it provides the expected response. You can also check the output token count using result.usageMetadata.

The model output truncated the response when the limit was exceeded (after 4490 key).

"key4489": "value4489",
"key4490": "value44

Exceeding Limit Token Count:

{
  promptTokenCount: 2038,
  candidatesTokenCount: 65134,
  totalTokenCount: 68043,
  promptTokensDetails: [ { modality: 'TEXT', tokenCount: 2038 } ],
  thoughtsTokenCount: 871
}

Edit 1: Your total output token count will be candidatesTokenCount + thoughtsTokenCount.

Thank You!!

Anand_Sharma1 · May 24, 2025, 5:25am

Facing same issue with gemini-2.5-flash-preview-04-17. Following is for a run when using API:

thinkingBudget: 4048
maxToken: 24000

Total Tokens: 16738
Prompt Tokens: 5676
Output Tokens: 5134
Thoughts Token: 5928
Finish Reason: STOP

Notice the FINISH reason is STOP, and tokens are less than 17K.
But the JSON resposne is frequently truncated (though not always). Didnt notice this any time when working in studio..

fun callGemini(text: String?, base64Image: String?, systemInstructions: String): String {
        val request = buildGeminiRequest(text, base64Image, systemInstructions)
        val headers = HttpHeaders().apply {
            contentType = MediaType.APPLICATION_JSON
        }

        val entity = HttpEntity(request, headers)
        val apiKey = apiKeyManager.getApiKey()
        val url = "$apiUrl/$modelName:generateContent?key=$apiKey"
        try {
            val response = restTemplate.exchange(url, HttpMethod.POST, entity, Map::class.java).body
            return extractResponse(response)
        } catch (e: Exception) { //...
        }
    }

    private fun buildGeminiRequest( text: String?,  base64Image: String?,  systemInstructions: String,
    ): GeminiRequest {
        val parts = mutableListOf<Part>()
        text?.let { parts.add(Part.Text(it)) }
        base64Image?.let {

val cleanBase64 = it.removePrefix("data:image/jpeg;base64,")
val mimeType = "image/jpeg"

            // Add the image part to the request
            parts.add(
                Part.Image(
                    InlineData(
                        mime_type = "image/png",
                        data = cleanBase64
                    )
                )
            )
        }

        val supportsThinking = modelName.contains("gemini-2.5")

        val generationConfig = if (supportsThinking) {
            GenerationConfig(
                temperature = temperature,
                maxOutputTokens = maxTokens,
                thinkingConfig = ThinkingConfig(thinkingBudget)
            )
        } else {
            GenerationConfig(
                temperature = temperature,
                maxOutputTokens = maxTokens
            )
        }

        val request = GeminiRequest(
            contents = listOf(Content(parts = parts)),
            systemInstruction =  SystemInstruction(parts = listOf(Part.Text(systemInstructions))),
            generationConfig = generationConfig,
            tools =   listOf(Tool(google_search = GoogleSearch()))
        )
        return request
    }

Could there be an issue with the REST API?

cor · May 25, 2025, 12:52am

I am having this issue as well. Really annoying, it cuts off randomly mid sentence and seems to be pretty random. Using gemini-2.5-flash-preview-05-20

cor · May 27, 2025, 6:03pm

can we get some kind of update from the google team on this? its making 2.5 unusable for us and I had to downgrade to 2.0

Khosraw · June 2, 2025, 7:55am

Also having this issue on the May update of 2.5 Flash.

jmccain · June 3, 2025, 12:57am

Seconding that I am experiencing this issue. Seeing responses with both 2.5 flash and 2.5 pro sometimes halt in the middle of a sentence at random. This is with context and tokens that are well below the limits (<50k input tokens, <500 output tokens) and with or without structured outputs.

Anand_Sharma1 · June 3, 2025, 1:53am

Are you using SDK or API REST endpoint? Did you check the raw response? What are you getting in Finish Reason?
If using REST API directly - one possible reason could be that the response is received in multiple parts (content.parts), and these ought to be concatenated in order to complete the response.

Khosraw · June 3, 2025, 2:04am

I’m utilizing Vercel’s AI SDK and the Gemini provider they offer.

jmccain · June 3, 2025, 3:46pm

I experienced this using both the OpenAI compatible API and the Google GenAI python clients. I don’t have logs of the finish reason, but I’ll keep an eye out for that in future testing.

cor · June 3, 2025, 9:12pm

im using google/genai with vertex AI. issue is not present on 2.0

Anand_Sharma1 · June 4, 2025, 12:52am

The troubleshooting ought to start with Finish reason - if its STOP, most probably there is some issue in application layer. For thinking models, i read somewhere that the tokens are considered for context length, so at times it may cause truncation. What I have noticed is that even with thinkingBudget set to a low value, at times the models go on thinking and exhaust token limits. I have made it a best practice to print usage and stop reason during development to better understand behavior for my problem domain.

Khosraw · June 4, 2025, 1:04am

Will check stop reason to find out.

cor · June 4, 2025, 3:44am

whats strange is the exact same prompt / scenario is not doing it right now, i have tried 5 times to transcribe a long video and it got it correct all 5 times, where a week ago it was randomly truncating 4 out of the 5 times.

cor · June 5, 2025, 2:14am

ok it randomly started again. The finish reason is “stop”
but the output tokens are very low:

"candidatesTokensDetails": [
      {
         "modality": "TEXT",
         "tokenCount": 724
       }
     ],
     "thoughtsTokenCount": 974

i checked the candidates to see if there are other parts i need to stitch together and there is not. this appears to be a gemini bug

Jay · June 5, 2025, 4:41am

The irony is palpable…

The AI stopped after a code run turn, and also produces a very strange dialog of interchanges including Python snippets and results as part of the conversation - assuming I can see and a developer will present the Python code and its output - and that the end-user wants this.

I suggest that the top_p parameter be reduced when using the model, to see if the performance improves and truncation symptoms are reduced. Try 0.5

What I was hinting at, in my task run to receiving effortless AI statistics, is that there may always be a low probability of receiving a “stop” special token that halts the output. The default top_p is 0.95 - but this only eliminates the tail of the last 5% of the probability distribution of predicted tokens. The chance of a “stop” doesn’t have to be 5% - it only requires the entire mass of tokens below it to be greater that 5% for it to always be a possible choice in the token lottery of sampling.

The AI provides us that with the chance of an improper “stop” token as 0.01% (0.0001) per token, the median generation length by which 50% of the requests will have aborted prematurely is 6932 tokens. That increases to only 693 tokens if 0.1% chance per token.

Topic		Replies	Views
Truncated responses despite being under limits Gemini API api , gemini-2-5	2	57	June 11, 2025
Gemini 2.5 API bug: missing finishReason when max token limit is reached Gemini API api , gemini-api	1	364	April 30, 2025
Output tokens limit for the finetuned gemini flash 1.5 Gemini API fine-tuning	12	2422	October 12, 2024
Significant Difference in Response Quality between Google AI Studio and Gemini 2.5 Pro API (gemini-2.5-pro-03-25) Gemini API feedback , api , gemini-25 , gemini-2-5	7	431	June 4, 2025
"finishReason" : "MAX_TOKENS" - But Text is Empty Gemini API prompt , rate-limits	8	410	June 6, 2025