I am trying to understand the token count when using images with different resolutions.
I am aware of the documented behavior:
and specifically:
- With Gemini 2.0, image inputs with both dimensions <=384 pixels are counted as 258 tokens. Images larger in one or both dimensions are cropped and scaled as needed into tiles of 768x768 pixels, each counted as 258 tokens. Prior to Gemini 2.0, images used a fixed 258 tokens.
But my results do not match the documented behavior, or maybe I’m missing something.
My sample code below.
When counting tokens either with client.models.count_tokens
or by looking at the usage data in the response, but also doing this in the VertexAI playground, I get these numbers:
- 1 image 1920x1080 with a short text prompt (“describe”) results in over 1800 input tokens
- 1 image 1280x720 with a short text prompt (“describe”) results in over 1800 input tokens
- 1 image 640x360 with a short text prompt (“describe”) results in over 1800 input tokens
- 1 image 320x180 with a short text prompt (“describe”) results in 270 input tokens
- sending 4 images - 1920x1080 each results in a little over 1000 input tokens
The “1 image 320x180” with 270 matches the documented behavior for an image with less than 384 on each side, but the rest does not have lower token count as resolution is reduced.
for example, I would expect 640x360 to be one tile of 768x768 and with 258 tokens. but it is not the case.
Also, how come sending 4 images results in around 1000 tokens? I seems that in this case that each images is reduced internally or something and counts as 258 tokens and do not keep the original image resolution.
I would love some help understanding what’s going on there.
Thank you!