Gemini 3 output limited to ~4k tokens instead of 65k

User Written Introduction of Issue

In OpenCode and Gemini CLI, using OAuth or an API key, I cannot get Gemini 3 Flash Thinking to write an .md file that is longer than ~3000 tokens in one operation. This prevents Gemini from being able to comprehensively plan and write documents, such as a project SPEC. This is something Claude models do quite well, despite having a much smaller context window.

The Gemini models are represented as having a ~65k output limit, but in practice, I cannot get even a small fraction of that written to a file. Which has me wondering why.

After several hours of trying to figure this out myself, and talking with Gemini and ChatGPT, I’m at a loss. I’ve tried many changes to settings (maxOutputTokens, thinking on/off, etc.) and no meaningful change. It feels like there is some cap on output that prevents Gemini from being able to produce useful documents and responses. This severely degrades Gemini’s performance on high-complexity tasks, and limits it’s utility on low-complexity tasks that models like Sonnet 4.5, despite it less robust reasoning capabilities, excels at (e.g. data extraction and consolidation).

Currently, Gemini has zero capacity for orchestration in my workflow due to its severely constrained (or broken?) output limits. It cannot write the documentation most of my projects require to progress.

GPT 5.2 Written Distillation of Testing Data (User Reviewed)

What I tested (REST API evidence, in addition to OpenCode + Gemini CLI)

Environment:

  • Windows 10
  • Windows Terminal, PowerShell
  • Gemini API via AI Studio API key (also tested with a paid-tier key)
  • Endpoint: https://generativelanguage.googleapis.com/v1beta/...

1) models.get reports 65,536 output tokens (so the model metadata agrees with the published limit)

When I run models.get for models/gemini-3-flash-preview, the API reports:

  • inputTokenLimit: 1048576
  • outputTokenLimit: 65536
  • temperature: 1
  • topP: 0.95
  • topK: 64
  • thinking: true

So as far as get_model / models.get is concerned, the output limit is exactly what the docs imply: 65,536 output tokens.

2) But generateContent stops around ~3k output tokens with finishReason: STOP (not MAX_TOKENS), even with huge maxOutputTokens

I repeatedly asked Flash 3 to output very long markdown (ex: ā€œOutput a Markdown document that is 20,000 tokens longā€¦ā€) with:

  • maxOutputTokens = 60000
  • thinking enabled (thinkingLevel HIGH)
  • also tested removing thinking config entirely

What I get back over and over is:

  • finishReason: STOP
  • candidatesTokenCount: 2952 (typical)
  • thoughtsTokenCount: ~700–800 (varies)
  • Output length: ~2,000 words / ~13k chars range (very roughly)

I also ran a variant prompt (ā€œas long as possibleā€) and got essentially the same behavior:

  • finishReason: STOP
  • candidatesTokenCount: 2982

So the model is not being cut off by a configured ceiling (maxOutputTokens) and it is not reporting a max-token termination. It just stops early on its own.

3) Same behavior with paid tier key (so it does not look like only a free-tier restriction)

I repeated Flash 3 tests using an API key tied to a paid tier, and the results were the same: ~2952 output tokens and finishReason: STOP.

4) Side note: gemini-3-pro-preview returned 429 RESOURCE_EXHAUSTED on free tier for me

When I attempted Pro, the API returned HTTP 429 with RESOURCE_EXHAUSTED and quota metrics indicating free-tier limits effectively at 0 for that model in my case. (This may be unrelated to the Flash 3 output-stopping issue, but I’m including it for completeness.)

What I’m asking

I’m looking for feedback, information, anything to let me know of a solution or workaround, or if I should just give up entirely trying to use Gemini and divert all development resources to Anthropic and OpenAI.

Specifically:

  1. Is there a known issue where Gemini 3 Flash (Thinking) self-terminates around ~3k output tokens with finishReason: STOP, even when maxOutputTokens is set very high?

  2. Is there any documented mechanism to discourage early stopping for long-form generation (SPECs, long reports), or is the correct approach to ā€œcontinueā€ in multiple turns / multiple calls?

  3. If models.get reports outputTokenLimit: 65536, is it expected that a single generateContent call still cannot practically produce anywhere near that in one response?

  4. Are there recommended generation settings for long-form output (temperature, topP/topK, other flags) that actually allow multi-tens-of-thousands token outputs in a single call?

Because right now, in actual use, the ā€œ65k output limitā€ is effectively meaningless for document authoring. The model just stops.

Any guidance, confirmation of whether this is a known limitation/bug, or a recommended workaround would be appreciated.

1 Like

Hi @Sean_Smith , Could you please provide the usageMetadata from your JSON response? What is the thinkingLevel set in your configurations?

I don’t have full logs of everything. I’m a vibe coder, not a full developer. My goal was to report enough that others can reproduce and look into matters. That said, I did retain some of the buffer outputs, but these seem to truncate data. My terminal usually outputs ā€˜everything’ but for some reason when interacting with the Gemini REST API sections of text would disappear during responses/outputs. As I didn’t produce full logs, what I have retained is incomplete.

But the data I have is:

….
It seems it won’t let me share the .text file directly. When I tried to copy paste the text dump into chat, wouldn’t let me post because ā€˜too many links’. Seems like this forum isn’t meant for this type of work. I can’t spend more time on this problem than I already have, especially not to find workarounds to just sharing basic data in ways that are, easy in most other places (e.g. Github).

I don’t find any ā€˜values’ for usageMetadata in my dumps. I see things like:

PS C:\ai-work\dtd\testing> ā€œusageMetadata:ā€
usageMetadata:ā€

Hi @Sean_Smith , Since you are not able to share the .text file here, please feel free to share it to me via direct message. Even if you don’t have complete logs, any raw output you share will help us identify the specific finishReason or metadata to reproduce this issue on our end.

I’m not finding an option on this forum to send you a direct message. Not on your profile page, or my profile page, or something here on this forum thread.

@Sean_Smith, Click on my Profile Picture or Username, a small pop-up will appear. Then, click the Message button. I am attaching a screenshot for your reference.

@Sean_Smith, I’ve sent a direct message with you. I would appreciate your confirmation and your time in reviewing and responding to it.

Could you please share the findings? I experience the same issue and I am alsom on free tier. Is this a limit on free tier? Is it possible to adjust this in Pro tier? Thank you

I shared them with Sonali_Kumari1 via DM. This forum provided no option to share directly in the thread.

My observation is this behavior of output limits occurs with my AI Pro account while using OAuth in Gemini CLI and OpenCode and using AI studio API key free tier and paid tier 1 in OpenCode.

I note this occurs with normal personal account OAuth and OAuth for antigravity.

I note that Gemini seems to default to overly concise outputs (100 line long reports) even when directed to be comprehensive and verbose.

I’ve discovered some prompting workarounds to make Gemini kind of work. Directing it to write document plan/outline and then write each section turn by turn. But this too ends up being inadequate for a properly constructed document.

At the same time I’ve been noticing Gemini has bad behaviors when given tools and permissions. I direct it the session is only for research, discovery, and analysis, not to write to existing files or rewrite a new final notice, and it still writes over the existing notice with revisions in one session and in another writes a completely new final notice.

Gemini also seems to struggle to properly use tools, like MCP servers. It makes mistakes other models do not.

In short, Gemini 3 is not well behaved with its tool use, following restrictions, understanding and complying with user intent, and staying on-task.

The write output limits combined with these problems make it, difficult working with Gemini. Still, Gemini produces some interesting outputs.

I think the problem might be system instructions in OAuth via OpenCode differ from those in the web app. That maybe Gemini in Gemini CLI and OpenCode is ā€˜uninstructed’ except for whatever the user defines. This is just a hypothesis for now, but it’s my next research task.

I’d like to make better use of Gemini, but it really does seem like Google has intentionally used a nerf hammer on the model to prevent it being used to the fullest. Which makes me wonder if it’s a complete waste of time trying to develop for Gemini 3 models at this time.