I’m getting similarly truncated responses via the latest nodejs @google/genai package…
Error in startStreamingChat: Error: Incomplete JSON segment at the end
at ApiClient.processStreamResponse_1 (…/node_modules/google/genai/src/_api_client.ts:463:19)
at processStreamResponse_1.next ()
at resume (…/node_modules/google/genai/dist/node/index.js:2509:44)
at fulfill (…/node_modules/google/genai/dist/node/index.js:2511:31)
at processTicksAndRejections (node:internal/process/task_queues:105:5)
Thanks for raising this issue. I checked the Gemini documentation and the output token limit is 65,536 , which is the maximum you are setting.
Could you also share if you are seeing any errors in your console?
Also, what is the type of your input fileUri and (prompt/System instruction) ?
Based on my understanding , the issue might be related to how the response is being handled. Because, on Vertex AI, i have checked the ‘gemini-2.5-flash-preview-04-17’ model is able to generate larger responses (more then 3000 token). Thank You!!
Because the error is masked (generic 500 with generic message), I don’t know what the problem is or how to reliably reproduce it. Also, this is not an open source project “Modified by moderator”
This happens when using any of the 2.5 models, or 2.0 flash with thinking.
It does not appear to happen when I switch to 2.0 pro or 2.0 flash (without thinking)
The expected return token length is probably 10k or less (not deterministic).
@junkx Following up on my last comment, I have run the code snippet you shared with the text file using the model gemini-2.5-flash-preview-04-17.
code snippet:
import {
GoogleGenAI,
createUserContent,
createPartFromUri,
} from "@google/genai";
const documentAnalysisSchema = {};
const ai = new GoogleGenAI({ apiKey: 'API-key' });
async function main() {
const image = await ai.files.upload({
file: "example.txt",
});
const fileUri = image.uri;
const mimeType = image.mimeType;
const prompt = "Tell me about this text file in 5000 json keys";
const result = await ai.models.generateContent({
model: 'gemini-2.5-flash-preview-04-17',
contents: [
{ text: prompt },
{ fileData: { fileUri, mimeType } }
],
config: {
responseMimeType: 'application/json',
responseSchema: documentAnalysisSchema,
thinkingConfig: { includeThoughts: false },
maxOutputTokens: 65536
}
});
console.log(result.text);
console.log(result.usageMetadata)
I have observed that it truncates the response only if the token limit is exceeded; otherwise, it provides the expected response. You can also check the output token count using result.usageMetadata.
The model output truncated the response when the limit was exceeded (after 4490 key).
Notice the FINISH reason is STOP, and tokens are less than 17K.
But the JSON resposne is frequently truncated (though not always). Didnt notice this any time when working in studio..
fun callGemini(text: String?, base64Image: String?, systemInstructions: String): String {
val request = buildGeminiRequest(text, base64Image, systemInstructions)
val headers = HttpHeaders().apply {
contentType = MediaType.APPLICATION_JSON
}
val entity = HttpEntity(request, headers)
val apiKey = apiKeyManager.getApiKey()
val url = "$apiUrl/$modelName:generateContent?key=$apiKey"
try {
val response = restTemplate.exchange(url, HttpMethod.POST, entity, Map::class.java).body
return extractResponse(response)
} catch (e: Exception) { //...
}
}
private fun buildGeminiRequest( text: String?, base64Image: String?, systemInstructions: String,
): GeminiRequest {
val parts = mutableListOf<Part>()
text?.let { parts.add(Part.Text(it)) }
base64Image?.let {
val cleanBase64 = it.removePrefix("data:image/jpeg;base64,")
val mimeType = "image/jpeg"
// Add the image part to the request
parts.add(
Part.Image(
InlineData(
mime_type = "image/png",
data = cleanBase64
)
)
)
}
val supportsThinking = modelName.contains("gemini-2.5")
val generationConfig = if (supportsThinking) {
GenerationConfig(
temperature = temperature,
maxOutputTokens = maxTokens,
thinkingConfig = ThinkingConfig(thinkingBudget)
)
} else {
GenerationConfig(
temperature = temperature,
maxOutputTokens = maxTokens
)
}
val request = GeminiRequest(
contents = listOf(Content(parts = parts)),
systemInstruction = SystemInstruction(parts = listOf(Part.Text(systemInstructions))),
generationConfig = generationConfig,
tools = listOf(Tool(google_search = GoogleSearch()))
)
return request
}
Seconding that I am experiencing this issue. Seeing responses with both 2.5 flash and 2.5 pro sometimes halt in the middle of a sentence at random. This is with context and tokens that are well below the limits (<50k input tokens, <500 output tokens) and with or without structured outputs.
Are you using SDK or API REST endpoint? Did you check the raw response? What are you getting in Finish Reason?
If using REST API directly - one possible reason could be that the response is received in multiple parts (content.parts), and these ought to be concatenated in order to complete the response.
I experienced this using both the OpenAI compatible API and the Google GenAI python clients. I don’t have logs of the finish reason, but I’ll keep an eye out for that in future testing.
The troubleshooting ought to start with Finish reason - if its STOP, most probably there is some issue in application layer. For thinking models, i read somewhere that the tokens are considered for context length, so at times it may cause truncation. What I have noticed is that even with thinkingBudget set to a low value, at times the models go on thinking and exhaust token limits. I have made it a best practice to print usage and stop reason during development to better understand behavior for my problem domain.
whats strange is the exact same prompt / scenario is not doing it right now, i have tried 5 times to transcribe a long video and it got it correct all 5 times, where a week ago it was randomly truncating 4 out of the 5 times.
The AI stopped after a code run turn, and also produces a very strange dialog of interchanges including Python snippets and results as part of the conversation - assuming I can see and a developer will present the Python code and its output - and that the end-user wants this.
I suggest that the top_p parameter be reduced when using the model, to see if the performance improves and truncation symptoms are reduced. Try 0.5
What I was hinting at, in my task run to receiving effortless AI statistics, is that there may always be a low probability of receiving a “stop” special token that halts the output. The default top_p is 0.95 - but this only eliminates the tail of the last 5% of the probability distribution of predicted tokens. The chance of a “stop” doesn’t have to be 5% - it only requires the entire mass of tokens below it to be greater that 5% for it to always be a possible choice in the token lottery of sampling.
The AI provides us that with the chance of an improper “stop” token as 0.01% (0.0001) per token, the median generation length by which 50% of the requests will have aborted prematurely is 6932 tokens. That increases to only 693 tokens if 0.1% chance per token.