How does AI Studio calculate "Token Usage" vs. Actual Context Window?

Hi everyone,

I’m trying to understand the exact mechanics of the “Token Usage” (Cost Estimation) panel in the Google AI Studio Chat interface (specifically using Gemini 3.1 Pro Preview). I’ve run into some confusing discrepancies between what the UI displays and how the actual context window/stateless API seems to work.

Here are the objective facts from my recent tests:

Observation 1: The UI Input counter does not append conversation history.

  1. I sent an initial prompt of exactly 15 tokens. The model generated a 1,756-token output. The UI panel updated to show → Input tokens: 15, Output tokens: 1756.
  2. I then sent a follow-up prompt of exactly 2 tokens (“Continue”). The model replied with 1,580 tokens.
  3. After this second turn, the panel updated to show → Input tokens: 17 (which is just 15 + 2), and Output tokens: 3336 (1756 + 1580).
    This proves the Input tokens metric in the UI is just a cumulative sum of the raw text I manually typed, rather than the actual payload (History + New Prompt) required by a stateless API.

Observation 2: Massive Output tokens and crossing the 1M limit without errors.
In a much longer session, my panel showed Input tokens: ~355k and Output tokens: ~703k, resulting in a Total tokens: ~1.05M. The UI progress bar turned red, indicating I had exceeded the 1,048,576 maximum context window. However, I was able to continue chatting perfectly fine without any “context window exceeded” errors.

The massive Output count seems to include the visible text plus the massive hidden Chain-of-Thought (CoT) processes (the “Thoughts” feature).

Based on these observations, I have a few specific questions for the community or the dev team:

  1. How is the actual conversation history sent to the backend? Since the UI’s Input tokens counter clearly ignores previous AI outputs, what is the actual size of the payload being sent?
  2. How are CoT / “Thoughts” handled in the context history? Given that the historical Output tokens (including raw CoT) are huge, sending them all back would instantly blow up the 1M context limit. Does the backend completely discard the raw CoT after generation?
  3. Are thought summaries or digital signatures used? To maintain reasoning continuity without passing back hundreds of thousands of raw CoT tokens, does the frontend only pass back the visible text? Or does it pass back a lightweight “thought summary” or some kind of encrypted digital signature/state token to the backend?
  4. How can I track the real context size? For developers writing long-form content (like novels), how can we accurately monitor the true context window usage per request to avoid silent truncation, since this UI panel seems to only act as a cumulative billing ledger?

Any insights into the actual engineering behind this UI and the API payload construction would be greatly appreciated!