Token counting mismatch between AI Studio Playground and API usageMetadata when using Function Calling

Ken_Ogura · November 14, 2025, 7:22am

I am building a service using the Gemini API and its Function Calling feature.
While testing token usage, I noticed that the token counts shown in the AI Studio Playground
do not match the usageMetadata values returned from the API.
This makes it unclear which values are actually used for billing.

Here is the behavior I am seeing:

In AI Studio Playground
- I send a user query.
- The model returns a function call JSON.
- I execute the function separately and paste the function result (JSON) back into the Playground.
- The longer the function result JSON becomes, the more the “Output Tokens” of “Token Usage” and “Output token cost” of “Cost Estimation” increase.
- This suggests that the function result is being counted as output tokens.
In the actual API (generateContent)
- Step 1: Model generates the function call JSON → counted as candidatesTokenCount (expected).
- Step 2: I send the function result JSON back as input.
- In this step, increasing the size of the function result only increases the promptTokenCount, not the output count.
- This matches the billing model as I understand it:
  function results should be counted as input tokens.

Based on the documentation, I believe:

Everything passed into a generateContent call (system_instruction, tools, history, and function results)
should count as input tokens (promptTokenCount).
Only model-generated content should count as output tokens (candidatesTokenCount).

Because Playground is attributing more tokens to “Output” and increasing “Estimated Cost” when I enlarge the function result, the behavior seems inconsistent with the API.

My questions:

Which values are authoritative for billing:
Playground’s “Estimated Cost” or the API’s usageMetadata?
In Function Calling:
Is the function result JSON always counted as input tokens (promptTokenCount) in the next model call?
Is the Playground UI misclassifying or double-counting tokens, especially across multi-step Function Calling flows?
If this mismatch is unintended, is it a known issue?

I can provide full logs and screenshots, but I also included a simplified reproducible scenario below.

Optional: Detailed Reproduction Steps (for anyone who wants deeper context)

To test token behavior more clearly, I prepared a minimal setup:

Function definition

A function that takes a team name and returns a list of members:

Input: team name (e.g., "Red")
Output: JSON array listing team members

Example (simplified):

{
  "name": "getTeamMembers",
  "description": "Returns the list of members in a team",
  "parameters": {
    "type": "object",
    "properties": {
      "teamName": { "type": "string" }
    },
    "required": ["teamName"]
  }
}

Prompt used

What is the number of members in Team 'Red'? 
Please fetch the member list using the function and tell me only the final count.

Experiment

I executed the Function Calling flow twice:

Case A: The function returns a small list (~10 members)
Case B: The function returns a large list (1000 members)

Observed results

Playground:
- A → low Output Tokens
- B → Output Tokens increased significantly
  
  The left image corresponds to the result of A and the right one corresponds to the result of B775×589 73.9 KB
API (generateContent):
- A → promptTokenCount increases based on the response size
- ```
[first (sent the prompt)] usage: {
  promptTokenCount: 89,
  candidatesTokenCount: 17,
  totalTokenCount: 224
}
[second (sent the function call response)] usage: {
  promptTokenCount: 221,
  candidatesTokenCount: 2,
  totalTokenCount: 438
}
```
- B → promptTokenCount increases proportionally
- ```
[first] usage: {
  promptTokenCount: 89,
  candidatesTokenCount: 17,
  totalTokenCount: 168
}
[second] usage: {
  promptTokenCount: 11169,
  candidatesTokenCount: 16,
  totalTokenCount: 11185
}
```
- candidatesTokenCount does not change and promptTokenCount increases
  → consistent with “function result = input tokens”
- For more details, please see this repository: GitHub - route-D/function-calling-token-count-experiment: An experiment to determine whether function call results are counted as input tokens or output tokens in LLMs.

This mismatch is why I’m trying to confirm the intended behavior.

Thank you very much!
Happy to provide more data if needed.

Siddharth_Naik · January 6, 2026, 2:57pm

Hello,

Thank you for using the forum. We were able to reproduce the issue with the help of the instructions you provided, and we will pass this information on to the Gemini development team.

Topic		Replies	Views
Bug: input tokens from response_schema not included in response usage_metadata.prompt_token_count (or anywhere) Gemini API bug , api	1	70	September 30, 2025
Gemini 1.5 Pro charges x6 more tokens than expected on text prompts Gemini API gemini-15 , bug , api , gemini-api , gemini	8	309	June 10, 2024
Count of token with cache_context same as without cache_context Gemini API context_caching	2	172	November 14, 2025
Token counts mismatch - 9x discrepancy! Gemini API bug , api	9	625	April 17, 2025
Unexpected billing spike for Gemini API (2025-10-15 to 2025-10-18) — request logs / billing review requested Gemini API gemini-15 , ai-studio , models , billing	1	118	November 10, 2025