Question with "Usage" in OpenAI compatibility

Hi, I’m using the Gemini 2.5 Pro model with OpenAI compatibility for my project. The openai node sdk version I am using is 5.12.2

I am trying to use the usage object to approximate cost per request but I am seeing some odd behavior.

Here is an example usage from a chat Completion response:

 usage: { completion_tokens: 102, prompt_tokens: 758, total_tokens: 1725 },

I am assume the completion_token didn’t include the thinking_tokening. It doesn’t make sense to me that completion_tokens + prompt_tokens != total_tokens

Another question is that how can I get the information on cached_tokens?
Here is an example usage when I run with GPT-5 with low reasoning:

usage: {
    prompt_tokens: 1486,
    completion_tokens: 651,
    total_tokens: 2137,
    prompt_tokens_details: { cached_tokens: 1408, audio_tokens: 0 },
    completion_tokens_details: {
      reasoning_tokens: 512,
      audio_tokens: 0,
      accepted_prediction_tokens: 0,
      rejected_prediction_tokens: 0
    }
  }

The cached token information is inside prompt_tokens_details but it is missing in the usage object from gemini model.

Would I have to switch to use the google gemini node sdk in order to get a more accurate usage object?

1 Like

Hello,

I would recommend going through count token doc for detailed information on this topic.

To answer your questions:

  1. All tokens are processed in a query are included in total token that include input, output and thinking.

  2. To get token count with Google SDK, you can refer to count token doc. Code:

from google import genai

client = genai.Client()
prompt = "The quick brown fox jumps over the lazy dog."

# Count tokens using the new client method.
total_tokens = client.models.count_tokens(
    model="gemini-2.0-flash", contents=prompt
)
print("total_tokens: ", total_tokens)
# ( e.g., total_tokens: 10 )

response = client.models.generate_content(
    model="gemini-2.0-flash", contents=prompt
)

# The usage_metadata provides detailed token counts.
print(response.usage_metadata)
# ( e.g., prompt_token_count: 11, candidates_token_count: 73, total_token_count: 84 )
  1. If you run “response.usage_metadata” command you will get information about cached_tokens as well.

  2. To get token count with OpenAI compatibility you can use “response.usage”. Example code below:

from openai import OpenAI

client = OpenAI(
    api_key="GEMINI_API_KEY",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

response = client.chat.completions.create(
    model="gemini-2.5-flash",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "Explain to me how AI works"
        }
    ]
)

print(response.usage)

Hi Lalit Kumar,
I am not sure I am understanding your comment.
This is the object when I print out response.usage

{completion_tokens: 102, prompt_tokens: 758, total_tokens: 1725}

The model I used was gemini-2.5-pro and nodeJS sdk with OpenAI compatibility .

The input token + output token does not equal the total token which it is not a blocker for me because I can get the real output token by doing total minus prompt.

I was reading the link you shared Understand and count tokens  |  Gemini API  |  Google AI for Developers

Non of the example response included cached_content_token_count, does this mean that it only response with that field if cached_content_token_count > 0 ? Does the same apply to OpenAI compatibility?

Hello,

To understand context caching token count in detail you can go through context caching docs. Here you can find a python example as well.