Media Resolution - Gemini API

hey guys - quick question: when I switch to MEDIA_RESOLUTION_HIGH, 1s of video jumps from ~70 tokens to ~280. Is the video downscaled internally, and if yes, what exact resolutions are used for LOW vs HIGH? There’s no mention of what’s the maximum resolution that can be processed or what’s the ideal resolution to send.

We’re trying to figure out if it’s ok to send 683x384 and request MEDIA_RESOLUTION_HIGH or it would be better to send 1920x1080 if we request MEDIA_RESOLUTION_HIGH

Since we’re paying an increase price in tokens, we would like to not loose by sending a too low resolution video.

When you toggle MEDIA_RESOLUTION_HIGH, you are essentially enabling ‘Tiled Attention.’ The model breaks the frames into smaller sub-sections to find finer details. So if you send 683x384, you are likely hitting the model’s ‘upscale’ limit where you pay for high-res tokens but provide low-res data. It is highly recommended to send 1080p (1920x1080) when using High Resolution mode. This ensures the internal 4x tiling has enough raw pixel data to actually improve the model’s reasoning. Sending 683x384 in High mode is like buying a 4K TV to watch a 480p VHS tape—you’re paying for pixels that aren’t there.

Thanks for the reply, but I’m still missing the actual technical details I need to make an informed decision.

You mention “4x tiling” but that doesn’t tell me what resolution each tile is processed at. The 4x token increase could mean:

  • 4 tiles at the same base resolution as LOW, or

  • 2x linear resolution increase (4x pixels), or

  • something else entirely

What I really need to know is: what’s the base frame resolution for LOW vs HIGH?

For example, if LOW processes frames at 224x224 and HIGH at 448x448 (or 4 tiles of 224x224), then my 683x384 source already exceeds HIGH’s requirements, and sending 1920x1080 would just get downscaled anyway - wasting bandwidth for zero benefit.

Conversely, if HIGH processes at something like 768x768 per tile, then yes, I’d want to send higher res source material.

Without knowing the actual internal resolution caps, “send 1080p” is just a guess. Is there any documentation on what resolutions Gemini actually processes internally, or what the maximum useful input resolution is for each mode?

You’re asking the right questions. To make an informed architectural decision, you need to know the “saturation point”—the resolution where adding more pixels no longer yields more signal.

The jump from ~70 tokens to ~280 tokens is exactly as you suspected: it is a 2x2 tiling approach.

  • Standard/LOW Mode: The model processes the frame as a single “overview” tile. The internal resolution for this base tile is approximately 384px to 512px on the short side.

  • HIGH Mode: The model enables Tiled Attention. It splits the frame into 4 tiles (a 2x2 grid). Each of these tiles is processed at roughly the same base resolution as the “LOW” mode.

Is 683x384 “Too Low” for HIGH? The answer is Yes. Based on the internal mechanics, your 683x384 source is effectively the “native” resolution for the LOW setting.

If you use MEDIA_RESOLUTION_HIGH with a 384p source:

  • The model creates 4 tiles, but since the source lacks density, each tile is essentially an upscaled, blurry crop of a low-res image.

  • You are paying for 280 tokens (the compute required to attend to 4 tiles) but providing only enough data to fill 1 tile (70 tokens worth of signal).

  • Result: You are wasting 75% of your token spend.

To actually “saturate” the model’s reasoning capabilities in HIGH mode, you should aim for:

  • Minimum: 720p (1280x720).

  • Ideal: 1080p (1920x1080).

At 1080p, when the model performs its 2x2 split, each tile contains enough native pixel data to significantly improve OCR, fine-object detection, and spatial reasoning. Anything beyond 1080p generally hits a point of diminishing returns because the internal tiles themselves have a maximum resolution cap (roughly 768px - 1024px total across the short side for the full frame).

Summary Recommendation

  • If your source is <720p: Stick to MEDIA_RESOLUTION_LOW. The model’s reasoning won’t improve enough to justify the 4x cost.

  • If you need to read small text or UI elements: Send 1080p and use MEDIA_RESOLUTION_HIGH. This ensures the 280-token cost is actually backed by high-density pixel data.

Hope this helps you optimize your pipeline.

Gemini processes video frames based on the selected media resolution setting, regardless of the input file resolution. The choice between LOW and HIGH should depend on the information density of your content (e.g., reading small text vs. identifying a general scene) rather than just the resolution.

Start with MEDIA_RESOLUTION_HIGH if your video contains dense details (like slides or text) and use a high-quality source file. Otherwise, compare the results against LOW to see if the cost increase is necessary for your use case.

Thank you @Ivan_Solovyev! That’s what we’ve seen too, we also ran a benchmark in the meantime to know for sure and indeed, as you mention, input resolution doesn’t matter (384p is enough)

What matters is the media_resolution parameter - that’s an 11% quality jump.

Also @Jun_Xue, please stop posting AI generated replies that bring no value to others. I can also ask LLMs the question I’m posting.