Count of token with cache_context same as without cache_context

Problem Description
I noticed a discrepancy between the total_token_count value returned by the API and the expected calculation based on usage metadata, particularly when using cache (CreateCachedContent).

Expected Behavior
I would have expected total_token_count to be calculated as the sum of non-cached tokens (prompt_token_count - cached_content_token_count) and generated tokens (candidates_token_count). This would reduce the total token count for requests using cache, reflecting a cost savings.

Current Behavior
The total_token_count value instead appears to be calculated as the full sum of prompt_token_count and candidates_token_count, ignoring the cached portion. This behavior is cost-misleading, as it doesn’t reflect the benefit of cache.

Reproduction Steps
Make a first API call with a large prompt.
Make a second call with the same prompt, ensuring the content is served from the cache.
Analyze the usage metadata from the second call.

Technical Details

API Call (cached):

response = self.client.models.generate_content(
model=self.model,
contents=[dynamic_prompt],
config=types.GenerateContentConfig(
cached_content=cache_id,
**self.model_config)
)

Example of USAGE METADATA (without cache, i.e. without the cached_content=cache_id line, in the code):
cache_tokens_details=None cached_content_token_count=None candidates_token_count=194 candidates_tokens_details=[ModalityTokenCount(
modality=<MediaModality.TEXT: ‘TEXT’>,
token_count=194
)] prompt_token_count=5202 prompt_tokens_details=[ModalityTokenCount(
modality=<MediaModality.TEXT: ‘TEXT’>,
token_count=5202
)] thoughts_token_count=None tool_use_prompt_token_count=None tool_use_prompt_tokens_details=None total_token_count=5396 traffic_type=None

Example of USAGE METADATA (before using cache):
cache_tokens_details=[ModalityTokenCount(
modality=<MediaModality.TEXT: ‘TEXT’>,
token_count=4115
)] cached_content_token_count=4115 candidates_token_count=199 candidates_tokens_details=[ModalityTokenCount(
modality=<MediaModality.TEXT: ‘TEXT’>,
token_count=199
)] prompt_token_count=5332 prompt_tokens_details=[ModalityTokenCount(
modality=<MediaModality.TEXT: ‘TEXT’>,
token_count=5332
)] thoughts_token_count=None tool_use_prompt_token_count=None tool_use_prompt_tokens_details=None total_token_count=5531 traffic_type=None

Expected vs. Actual Count:

Expected Count: = 1217 + 199 = 1416

Actual Count: 5332 + 199 = 5531

Impact
This behavior makes it difficult to accurately estimate the costs of API calls that use the cache, as token count optimization is not apparent in the usage data. This can lead to incorrect cost calculations and potential resource waste.

Hi @Marco_Fosci,

Thanks for the reproduction steps.. I will try to reproduce this issue and troubleshoot what’s going on.

Hi @Marco_Fosci ,

The usage_metadata can be confusing but it’s really just for tracking purposes, not a reflection of your final billable tokens.

You’re also right that the bill will be less with context caching. But your described expected calculations are wrong. The actual billing calculation is more complex and dependent on many more parameters. Please refer “https://cloud.google.com/blog/products/ai-machine-learning/vertex-ai-context-caching” for more details on price optimization using context caching.

Context caching does not reduce the prompt token count. The prompt token count reflects the total number of input tokens for the model, regardless of whether they originate from the cache.

The main benefit of context caching is that the cached tokens are priced at a lower rate, not that it decreases the overall prompt token count.