Hi Google AI team,
I am using Gemini on Vertex AI with Flex PayGo via the global endpoint and these request headers:
X-Vertex-AI-LLM-Request-Type: shared
X-Vertex-AI-LLM-Shared-Request-Type: flex
I tested creating an explicit context cache with audio content and then using it in generate_content with the same Flex client. The API accepts it, and the generate response reports:
traffic_type: ON_DEMAND_FLEX
cached_content_token_count:
My question is about billing:
When using explicit context caching with a Flex PayGo client/header, is the cache creation billed at Flex PayGo pricing or standard PayGo input pricing?
Related question: when a cached content resource is referenced from a Flex PayGo request, do cached-token read discounts combine with Flex pricing, or is explicit-cache pricing applied independently?
Use case: audio pipeline where every clip is first classified, and around 50% of clips are then transcribed. We want to know whether explicit caching can preserve Flex’s 50% audio input discount, or whether we should rely only on implicit caching.
Thanks.