Hi - I wonder if anyone Google-side can help with a roadblock here. Gemini API documentation indicates that response_schema renders the schema into the input tokens / prompt, and therefore has an associated cost (per token count), which makes perfect sense. We need to accurately model the cost of each API call, which we are doing (assuming here no cached tokens & no function calls for simplicity) as prompt_token_count * input_token_cost + candidates_token_count * output_token_cost.
Problem is that input tokens added as a result of a response_schema being specified in the generate_content API call do not seem to be counted anywhere. Does this mean we are not being charged for response_schema tokens? If we are being charged, it doesn’t seem the generate_content API call includes these tokens in the usage_metadata, so no way to know how much we are being billed for the call.
Example:
from pydantic import BaseModel, Field
# ...
class TestSchema(BaseModel):
output: str = Field(description="model output")
test1: str = Field(description="always output: 'one'")
test2: str = Field(description="always output: 'two'")
test3: str = Field(description="always output: 'three'")
# ...
resp_with_schema = client.models.generate_content(
model="gemini-2.5-flash-lite",
config=types.GenerateContentConfig(
temperature=0,
candidate_count=1,
thinking_config=types.ThinkingConfig(thinking_budget=0),
response_mime_type="application/json",
response_schema=TestSchema,
),
contents=["Say hello."],
)
print(resp_with_schema.usage_metadata, resp_with_schema.candidates[0].content.parts[0].text)
resp_without_schema = client.models.generate_content(
model="gemini-2.5-flash-lite",
config=types.GenerateContentConfig(
temperature=0,
candidate_count=1,
thinking_config=types.ThinkingConfig(thinking_budget=0),
),
contents=["Say hello."],
)
print(resp_without_schema.usage_metadata, resp_without_schema.candidates[0].content.parts[0].text)
You’ll see this test code runs fine - the schema is respected when specified in response_schema. But the usage_metadata in both cases on the input side is the same (prompt_token_count = 4, which clearly doesn’t include the response schema tokens).
Is this an API bug? In any case, how do we know the total input token count, including the response_schema tokens, so we can multiply by the input token price to compute the correct cost?
Thank you!
-Adrian