Truncated Response Issue with Gemini 2.5 Flash Preview

Environment

  • Model: gemini-2.5-flash-preview-04-17
  • API: Gemini API (JavaScript/TypeScript)
  • Environment: Cloudflare Worker
  • Response Format: JSON with schema validation

Issue Description

We’re experiencing consistent truncation of responses from the Gemini model, despite:

  • Input tokens being well below limits (~3000 tokens)
  • Reponse well below limits (~3000 tokens)
  • Setting maxOutputTokens: 65536 in the config (tried without it, but same problem)
  • Using structured output with JSON schema
  • Response being truncated mid-sentence in random locations

Current Implementation

const result = await client.ai.models.generateContent({
    model: 'gemini-2.5-flash-preview-04-17',
    contents: [
        { text: prompt },
        { fileData: { fileUri, mimeType } }
    ],
    config: {
        responseMimeType: 'application/json',
        responseSchema: documentAnalysisSchema,
        thinkingConfig: { includeThoughts: false },
        maxOutputTokens: 65536
    }
});

Debugging Steps Taken

  1. Verified input token count:
const tokenCount = await client.ai.models.countTokens({
    model: 'gemini-2.5-flash-preview-04-17',
    contents: [/*...*/]
});
// Results show ~3000 tokens, well within limits
  1. Added console logging:
console.log('Raw response length:', response.length);
console.log('Raw response:', response);
console.log('Summary length:', parsedResponse.summary?.length);
  • The console output itself is truncated, suggesting a potential streaming/buffering issue
  1. Tested with different response sizes:
  • Small responses work fine
  • Medium to large responses get truncated consistently
  • Truncation occurs at different points in the text

Questions

  1. Is this a known issue with the preview version of Gemini 2.5 Flash?
  2. Are there specific limitations when using Cloudflare Workers with the Gemini API?
  3. Are there recommended workarounds for handling large responses in streaming environments?
  4. Should we implement chunking on our side, and if so, what’s the recommended approach?

Additional Context

  • Using structured output with a schema for consistent JSON responses
  • Need to handle documents of varying sizes
  • Currently implementing document analysis with title generation, summarization, and mind mapping
  • Response includes markdown formatting which might affect token counts

Any insights or recommendations would be greatly appreciated. We’re particularly interested in:

  • Best practices for handling large responses
  • Recommended configuration for Cloudflare Workers
  • Alternative approaches to structured output handling
2 Likes

I’m getting similarly truncated responses via the latest nodejs @google/genai package…

Error in startStreamingChat: Error: Incomplete JSON segment at the end
at ApiClient.processStreamResponse_1 (…/node_modules/google/genai/src/_api_client.ts:463:19)
at processStreamResponse_1.next ()
at resume (…/node_modules/google/genai/dist/node/index.js:2509:44)
at fulfill (…/node_modules/google/genai/dist/node/index.js:2511:31)
at processTicksAndRejections (node:internal/process/task_queues:105:5)

1 Like

Hi @junkx ,

Thanks for raising this issue. I checked the Gemini documentation and the output token limit is 65,536 , which is the maximum you are setting.

Could you also share if you are seeing any errors in your console?

Also, what is the type of your input fileUri and (prompt/System instruction) ?

Based on my understanding , the issue might be related to how the response is being handled. Because, on Vertex AI, i have checked the ‘gemini-2.5-flash-preview-04-17’ model is able to generate larger responses (more then 3000 token). Thank You!!

Hi @travisbigarmor ,

Thanks for sharing the error screenshot. Could you please also provide minimal reproducible code?

Because the error is masked (generic 500 with generic message), I don’t know what the problem is or how to reliably reproduce it. Also, this is not an open source project “Modified by moderator”

This happens when using any of the 2.5 models, or 2.0 flash with thinking.

It does not appear to happen when I switch to 2.0 pro or 2.0 flash (without thinking)

The expected return token length is probably 10k or less (not deterministic).

@junkx Following up on my last comment, I have run the code snippet you shared with the text file using the model gemini-2.5-flash-preview-04-17.

code snippet:

import {
    GoogleGenAI,
    createUserContent,
    createPartFromUri,
  } from "@google/genai";
  const documentAnalysisSchema = {};
  const ai = new GoogleGenAI({ apiKey: 'API-key' });
  
  async function main() {
    const image = await ai.files.upload({
      file: "example.txt",
    });
    const fileUri = image.uri;
    const mimeType = image.mimeType;
    const prompt = "Tell me about this text file in 5000 json keys";
    const result = await ai.models.generateContent({
        model: 'gemini-2.5-flash-preview-04-17',
        contents: [
            { text: prompt },
            { fileData: { fileUri, mimeType } }
        ],
        config: {
            responseMimeType: 'application/json',
            responseSchema: documentAnalysisSchema,
            thinkingConfig: { includeThoughts: false },
            maxOutputTokens: 65536
        }
    });
    console.log(result.text);
    console.log(result.usageMetadata)

I have observed that it truncates the response only if the token limit is exceeded; otherwise, it provides the expected response. You can also check the output token count using result.usageMetadata.

The model output truncated the response when the limit was exceeded (after 4490 key).

"key4489": "value4489",
"key4490": "value44

Exceeding Limit Token Count:

{
  promptTokenCount: 2038,
  candidatesTokenCount: 65134,
  totalTokenCount: 68043,
  promptTokensDetails: [ { modality: 'TEXT', tokenCount: 2038 } ],
  thoughtsTokenCount: 871
}

Edit 1: Your total output token count will be candidatesTokenCount + thoughtsTokenCount.

Thank You!!

Facing same issue with gemini-2.5-flash-preview-04-17. Following is for a run when using API:

thinkingBudget: 4048
maxToken: 24000

Total Tokens: 16738
Prompt Tokens: 5676
Output Tokens: 5134
Thoughts Token: 5928
Finish Reason: STOP

Notice the FINISH reason is STOP, and tokens are less than 17K.
But the JSON resposne is frequently truncated (though not always). Didnt notice this any time when working in studio..

fun callGemini(text: String?, base64Image: String?, systemInstructions: String): String {
        val request = buildGeminiRequest(text, base64Image, systemInstructions)
        val headers = HttpHeaders().apply {
            contentType = MediaType.APPLICATION_JSON
        }

        val entity = HttpEntity(request, headers)
        val apiKey = apiKeyManager.getApiKey()
        val url = "$apiUrl/$modelName:generateContent?key=$apiKey"
        try {
            val response = restTemplate.exchange(url, HttpMethod.POST, entity, Map::class.java).body
            return extractResponse(response)
        } catch (e: Exception) { //...
        }
    }

    private fun buildGeminiRequest( text: String?,  base64Image: String?,  systemInstructions: String,
    ): GeminiRequest {
        val parts = mutableListOf<Part>()
        text?.let { parts.add(Part.Text(it)) }
        base64Image?.let {

val cleanBase64 = it.removePrefix("data:image/jpeg;base64,")
val mimeType = "image/jpeg"

            // Add the image part to the request
            parts.add(
                Part.Image(
                    InlineData(
                        mime_type = "image/png",
                        data = cleanBase64
                    )
                )
            )
        }

        val supportsThinking = modelName.contains("gemini-2.5")

        val generationConfig = if (supportsThinking) {
            GenerationConfig(
                temperature = temperature,
                maxOutputTokens = maxTokens,
                thinkingConfig = ThinkingConfig(thinkingBudget)
            )
        } else {
            GenerationConfig(
                temperature = temperature,
                maxOutputTokens = maxTokens
            )
        }

        val request = GeminiRequest(
            contents = listOf(Content(parts = parts)),
            systemInstruction =  SystemInstruction(parts = listOf(Part.Text(systemInstructions))),
            generationConfig = generationConfig,
            tools =   listOf(Tool(google_search = GoogleSearch()))
        )
        return request
    }

Could there be an issue with the REST API?

I am having this issue as well. Really annoying, it cuts off randomly mid sentence and seems to be pretty random. Using gemini-2.5-flash-preview-05-20

can we get some kind of update from the google team on this? its making 2.5 unusable for us and I had to downgrade to 2.0

Also having this issue on the May update of 2.5 Flash.

Seconding that I am experiencing this issue. Seeing responses with both 2.5 flash and 2.5 pro sometimes halt in the middle of a sentence at random. This is with context and tokens that are well below the limits (<50k input tokens, <500 output tokens) and with or without structured outputs.

Are you using SDK or API REST endpoint? Did you check the raw response? What are you getting in Finish Reason?
If using REST API directly - one possible reason could be that the response is received in multiple parts (content.parts), and these ought to be concatenated in order to complete the response.

I’m utilizing Vercel’s AI SDK and the Gemini provider they offer.

I experienced this using both the OpenAI compatible API and the Google GenAI python clients. I don’t have logs of the finish reason, but I’ll keep an eye out for that in future testing.

im using google/genai with vertex AI. issue is not present on 2.0

The troubleshooting ought to start with Finish reason - if its STOP, most probably there is some issue in application layer. For thinking models, i read somewhere that the tokens are considered for context length, so at times it may cause truncation. What I have noticed is that even with thinkingBudget set to a low value, at times the models go on thinking and exhaust token limits. I have made it a best practice to print usage and stop reason during development to better understand behavior for my problem domain.

Will check stop reason to find out.

whats strange is the exact same prompt / scenario is not doing it right now, i have tried 5 times to transcribe a long video and it got it correct all 5 times, where a week ago it was randomly truncating 4 out of the 5 times.

ok it randomly started again. The finish reason is “stop”
but the output tokens are very low:

"candidatesTokensDetails": [
      {
         "modality": "TEXT",
         "tokenCount": 724
       }
     ],
     "thoughtsTokenCount": 974

i checked the candidates to see if there are other parts i need to stitch together and there is not. this appears to be a gemini bug

The irony is palpable…

The AI stopped after a code run turn, and also produces a very strange dialog of interchanges including Python snippets and results as part of the conversation - assuming I can see and a developer will present the Python code and its output - and that the end-user wants this.

I suggest that the top_p parameter be reduced when using the model, to see if the performance improves and truncation symptoms are reduced. Try 0.5

What I was hinting at, in my task run to receiving effortless AI statistics, is that there may always be a low probability of receiving a “stop” special token that halts the output. The default top_p is 0.95 - but this only eliminates the tail of the last 5% of the probability distribution of predicted tokens. The chance of a “stop” doesn’t have to be 5% - it only requires the entire mass of tokens below it to be greater that 5% for it to always be a possible choice in the token lottery of sampling.

The AI provides us that with the chance of an improper “stop” token as 0.01% (0.0001) per token, the median generation length by which 50% of the requests will have aborted prematurely is 6932 tokens. That increases to only 693 tokens if 0.1% chance per token.