Potato
August 19, 2025, 4:45am
1
Hi, I’m using the Gemini 2.5 Pro model with OpenAI compatibility for my project. The openai node sdk version I am using is 5.12.2
I am trying to use the usage object to approximate cost per request but I am seeing some odd behavior.
Here is an example usage from a chat Completion response:
usage: { completion_tokens: 102, prompt_tokens: 758, total_tokens: 1725 },
I am assume the completion_token didn’t include the thinking_tokening. It doesn’t make sense to me that completion_tokens + prompt_tokens != total_tokens
Another question is that how can I get the information on cached_tokens?
Here is an example usage when I run with GPT-5 with low reasoning:
usage: {
prompt_tokens: 1486,
completion_tokens: 651,
total_tokens: 2137,
prompt_tokens_details: { cached_tokens: 1408, audio_tokens: 0 },
completion_tokens_details: {
reasoning_tokens: 512,
audio_tokens: 0,
accepted_prediction_tokens: 0,
rejected_prediction_tokens: 0
}
}
The cached token information is inside prompt_tokens_details but it is missing in the usage object from gemini model.
Would I have to switch to use the google gemini node sdk in order to get a more accurate usage object?
1 Like
Hello,
I would recommend going through count token doc for detailed information on this topic.
To answer your questions:
All tokens are processed in a query are included in total token that include input, output and thinking.
To get token count with Google SDK, you can refer to count token doc. Code:
from google import genai
client = genai.Client()
prompt = "The quick brown fox jumps over the lazy dog."
# Count tokens using the new client method.
total_tokens = client.models.count_tokens(
model="gemini-2.0-flash", contents=prompt
)
print("total_tokens: ", total_tokens)
# ( e.g., total_tokens: 10 )
response = client.models.generate_content(
model="gemini-2.0-flash", contents=prompt
)
# The usage_metadata provides detailed token counts.
print(response.usage_metadata)
# ( e.g., prompt_token_count: 11, candidates_token_count: 73, total_token_count: 84 )
If you run “response.usage_metadata ” command you will get information about cached_tokens as well.
To get token count with OpenAI compatibility you can use “response.usage ”. Example code below:
from openai import OpenAI
client = OpenAI(
api_key="GEMINI_API_KEY",
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
response = client.chat.completions.create(
model="gemini-2.5-flash",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": "Explain to me how AI works"
}
]
)
print(response.usage)
Potato
August 21, 2025, 2:57pm
5
Hi Lalit Kumar,
I am not sure I am understanding your comment.
This is the object when I print out response.usage
{completion_tokens: 102, prompt_tokens: 758, total_tokens: 1725}
The model I used was gemini-2.5-pro and nodeJS sdk with OpenAI compatibility .
The input token + output token does not equal the total token which it is not a blocker for me because I can get the real output token by doing total minus prompt.
I was reading the link you shared Understand and count tokens | Gemini API | Google AI for Developers
Non of the example response included cached_content_token_count, does this mean that it only response with that field if cached_content_token_count > 0 ? Does the same apply to OpenAI compatibility?
Hello,
To understand context caching token count in detail you can go through context caching docs . Here you can find a python example as well.