Gemini 2.5 Computer Use is using more tokens than I expected

Hey! As the title suggests, I am confused about this new model, and also the tokens that I’ve used while testing it,

I started experimenting with it and just set up my account today with tier-1 pricing, so everything is fresh,

And in just the past few hours, with about 70 requests I have already blown up to over 200k tokens for Gemini-2.5-Pro. And a much more modest, 13k tokens for the computer-use-preview.

Why are the gemini 2.5 pro tokens so high? My prompt is a short sentence, about 20 words. I do pass a screenshot each time as outlined in the documentation so that the model can iterate. But this should not be a crazy high token count should it? I set MediaResolution to “High” which in the C# api says, 256 tokens,

Also, I am not exactly sure how this gets priced, though that’s a separate question. When the website gives you a price-per-million tokens and then says 2 options for prompts less than 200k tokens and prompts more than 200k tokens, I don’t understand how this works. I thought the computer-use prompts are limited to 128k tokens input?

TL:DR, I am just confused why the gemini-2.5-pro input tokens are over 200k, it does not feel like it should be this high, unless I really misunderstand the media resolution or am missing some best-practices for keeping token counts down low

Thanks!
David

Some more details today:

This is the usage metadata from 1 run. And obviously, the history doubles with each run since the previous screenshots are still in the chat history.

{
  “usageMetadata”: 
  {
  “candidatesTokenCount”: 44,
  “promptTokenCount”: 2393,
  “promptTokensDetails”: 
  [
    {
      “modality”: “TEXT”,
      “tokenCount”: 71
    },
    {
      “modality”: “IMAGE”,
      “tokenCount”: 2322
    }
  ],
  “totalTokenCount”: 2437
}

I wonder if it is possible to remove the previous screenshot parts from the history… I assume this would have negative effects,

Also, this is jpeg with 60 quality, and the resolution is 1122 x 555 pixels. And I’ve set Media Resolution to Medium. I still feel like, 2322 tokens is very high? I set the global media resolution with the config, and I know it works since setting it to low reduces this further (though the AI starts not interacting with elements correctly, the returned X/Y coordinates for mouse clicks is wrong)

So here’s the table from the docs, which is very confusing because I don’t understand how an image can be just 64 tokens (gemini 2.5 models, low media resolution)

And then recommended settings, and in this table it shows that media resolution of High should be ~1120 tokens at most, maximum (or is this gemini 3 models?). So how is it that when I use Medium Resolution it’s doubling this? With a fairly low resolution and high JPEG compression…

I am just quite confused. I’d like to remove past screenshots from the history and let the AI operate off the latest image + its thought process text parts. But I haven’t tested this. Is it common?

Sorry, this post is all over the place. My first question remains: How is the token usage so high for the image input, when the docs show that high resolution images should be max 1120 tokens. I have a lower-resolution image with medium media quality and high jpeg compression ratio, and still receive almost double that number…

Thanks,

I was able to figure out a solution and learn some more about the APIs on the way so I will solve my own post with this reply:

First off, I didn’t know about tiles.

Image understanding | Gemini API | Google AI for Developers

So for those who are in a similar boat as me,

  1. Images are split into tiles and the cost on the website for processing an image with low, medium, or high resolution is counter per tile, not per image. Once I calculated the # of tiles * token cost, then things started making more sense.

  2. You totally can prune images from the history, and it will still work just fine. So what I did was for each request in the AI agent loop (after the first one), I have a function that removes all image attachments. So this is, any Part with InlineData that’s an image, and also any FunctionResponse I go through and set its image part to null as well.

  3. This saw a massive increase in cost-savings for token usage. I also removed the jpeg compression and just send up png now, since I guess the compression was not improving costs. It’s entirely based on the number of tiles (correct me if I am wrong)

Some quick and dirty numbers: Without pruning the history, I was adding over 1000 tokens to my input each time I made a request. So 5 iterations could end up being a grand total of 15,000 tokens:

1000 → 2000 → 3000 → 4000 → 5000

If you prune images and only send the most recent, you can keep it at an (almost) fixed rate per request (the history still grows for user prompts and response history):
1000 → 1000 → 1000 → 1000 → 1000

This worked for me, and might work for someone else!