Truncated Response Issue with Gemini 2.5 Flash Preview

Environment

  • Model: gemini-2.5-flash-preview-04-17
  • API: Gemini API (JavaScript/TypeScript)
  • Environment: Cloudflare Worker
  • Response Format: JSON with schema validation

Issue Description

We’re experiencing consistent truncation of responses from the Gemini model, despite:

  • Input tokens being well below limits (~3000 tokens)
  • Reponse well below limits (~3000 tokens)
  • Setting maxOutputTokens: 65536 in the config (tried without it, but same problem)
  • Using structured output with JSON schema
  • Response being truncated mid-sentence in random locations

Current Implementation

const result = await client.ai.models.generateContent({
    model: 'gemini-2.5-flash-preview-04-17',
    contents: [
        { text: prompt },
        { fileData: { fileUri, mimeType } }
    ],
    config: {
        responseMimeType: 'application/json',
        responseSchema: documentAnalysisSchema,
        thinkingConfig: { includeThoughts: false },
        maxOutputTokens: 65536
    }
});

Debugging Steps Taken

  1. Verified input token count:
const tokenCount = await client.ai.models.countTokens({
    model: 'gemini-2.5-flash-preview-04-17',
    contents: [/*...*/]
});
// Results show ~3000 tokens, well within limits
  1. Added console logging:
console.log('Raw response length:', response.length);
console.log('Raw response:', response);
console.log('Summary length:', parsedResponse.summary?.length);
  • The console output itself is truncated, suggesting a potential streaming/buffering issue
  1. Tested with different response sizes:
  • Small responses work fine
  • Medium to large responses get truncated consistently
  • Truncation occurs at different points in the text

Questions

  1. Is this a known issue with the preview version of Gemini 2.5 Flash?
  2. Are there specific limitations when using Cloudflare Workers with the Gemini API?
  3. Are there recommended workarounds for handling large responses in streaming environments?
  4. Should we implement chunking on our side, and if so, what’s the recommended approach?

Additional Context

  • Using structured output with a schema for consistent JSON responses
  • Need to handle documents of varying sizes
  • Currently implementing document analysis with title generation, summarization, and mind mapping
  • Response includes markdown formatting which might affect token counts

Any insights or recommendations would be greatly appreciated. We’re particularly interested in:

  • Best practices for handling large responses
  • Recommended configuration for Cloudflare Workers
  • Alternative approaches to structured output handling

I’m getting similarly truncated responses via the latest nodejs @google/genai package…

Error in startStreamingChat: Error: Incomplete JSON segment at the end
at ApiClient.processStreamResponse_1 (…/node_modules/google/genai/src/_api_client.ts:463:19)
at processStreamResponse_1.next ()
at resume (…/node_modules/google/genai/dist/node/index.js:2509:44)
at fulfill (…/node_modules/google/genai/dist/node/index.js:2511:31)
at processTicksAndRejections (node:internal/process/task_queues:105:5)

Hi @junkx ,

Thanks for raising this issue. I checked the Gemini documentation and the output token limit is 65,536 , which is the maximum you are setting.

Could you also share if you are seeing any errors in your console?

Also, what is the type of your input fileUri and (prompt/System instruction) ?

Based on my understanding , the issue might be related to how the response is being handled. Because, on Vertex AI, i have checked the ‘gemini-2.5-flash-preview-04-17’ model is able to generate larger responses (more then 3000 token). Thank You!!

Hi @travisbigarmor ,

Thanks for sharing the error screenshot. Could you please also provide minimal reproducible code?

Because the error is masked (generic 500 with generic message), I don’t know what the problem is or how to reliably reproduce it. Also, this is not an open source project “Modified by moderator”

This happens when using any of the 2.5 models, or 2.0 flash with thinking.

It does not appear to happen when I switch to 2.0 pro or 2.0 flash (without thinking)

The expected return token length is probably 10k or less (not deterministic).

@junkx Following up on my last comment, I have run the code snippet you shared with the text file using the model gemini-2.5-flash-preview-04-17.

code snippet:

import {
    GoogleGenAI,
    createUserContent,
    createPartFromUri,
  } from "@google/genai";
  const documentAnalysisSchema = {};
  const ai = new GoogleGenAI({ apiKey: 'API-key' });
  
  async function main() {
    const image = await ai.files.upload({
      file: "example.txt",
    });
    const fileUri = image.uri;
    const mimeType = image.mimeType;
    const prompt = "Tell me about this text file in 5000 json keys";
    const result = await ai.models.generateContent({
        model: 'gemini-2.5-flash-preview-04-17',
        contents: [
            { text: prompt },
            { fileData: { fileUri, mimeType } }
        ],
        config: {
            responseMimeType: 'application/json',
            responseSchema: documentAnalysisSchema,
            thinkingConfig: { includeThoughts: false },
            maxOutputTokens: 65536
        }
    });
    console.log(result.text);
    console.log(result.usageMetadata)

I have observed that it truncates the response only if the token limit is exceeded; otherwise, it provides the expected response. You can also check the output token count using result.usageMetadata.

The model output truncated the response when the limit was exceeded (after 4490 key).

"key4489": "value4489",
"key4490": "value44

Exceeding Limit Token Count:

{
  promptTokenCount: 2038,
  candidatesTokenCount: 65134,
  totalTokenCount: 68043,
  promptTokensDetails: [ { modality: 'TEXT', tokenCount: 2038 } ],
  thoughtsTokenCount: 871
}

Edit 1: Your total output token count will be candidatesTokenCount + thoughtsTokenCount.

Thank You!!

Facing same issue with gemini-2.5-flash-preview-04-17. Following is for a run when using API:

thinkingBudget: 4048
maxToken: 24000

Total Tokens: 16738
Prompt Tokens: 5676
Output Tokens: 5134
Thoughts Token: 5928
Finish Reason: STOP

Notice the FINISH reason is STOP, and tokens are less than 17K.
But the JSON resposne is frequently truncated (though not always). Didnt notice this any time when working in studio..

fun callGemini(text: String?, base64Image: String?, systemInstructions: String): String {
        val request = buildGeminiRequest(text, base64Image, systemInstructions)
        val headers = HttpHeaders().apply {
            contentType = MediaType.APPLICATION_JSON
        }

        val entity = HttpEntity(request, headers)
        val apiKey = apiKeyManager.getApiKey()
        val url = "$apiUrl/$modelName:generateContent?key=$apiKey"
        try {
            val response = restTemplate.exchange(url, HttpMethod.POST, entity, Map::class.java).body
            return extractResponse(response)
        } catch (e: Exception) { //...
        }
    }

    private fun buildGeminiRequest( text: String?,  base64Image: String?,  systemInstructions: String,
    ): GeminiRequest {
        val parts = mutableListOf<Part>()
        text?.let { parts.add(Part.Text(it)) }
        base64Image?.let {

val cleanBase64 = it.removePrefix("data:image/jpeg;base64,")
val mimeType = "image/jpeg"

            // Add the image part to the request
            parts.add(
                Part.Image(
                    InlineData(
                        mime_type = "image/png",
                        data = cleanBase64
                    )
                )
            )
        }

        val supportsThinking = modelName.contains("gemini-2.5")

        val generationConfig = if (supportsThinking) {
            GenerationConfig(
                temperature = temperature,
                maxOutputTokens = maxTokens,
                thinkingConfig = ThinkingConfig(thinkingBudget)
            )
        } else {
            GenerationConfig(
                temperature = temperature,
                maxOutputTokens = maxTokens
            )
        }

        val request = GeminiRequest(
            contents = listOf(Content(parts = parts)),
            systemInstruction =  SystemInstruction(parts = listOf(Part.Text(systemInstructions))),
            generationConfig = generationConfig,
            tools =   listOf(Tool(google_search = GoogleSearch()))
        )
        return request
    }

Could there be an issue with the REST API?