Question on image resolution scaling

Hi, in your technical report you note that the MedSigLIP model, while based on an 896x896 resolution encoder, is released at 448x448 by sharing the same model weights and using down-sampled positional embeddings. Could you elaborate on any challenges you faced when determining this scaling approach, or any limitations you observed with this method compared to, say, training a model natively at 448x448 from scratch without shared weights?

The vision encoder was primarily designed to support the medgemma model, which works at 896x896 resolution. Its weight can be found inside the medgemma ckpts.

This 448x448 release is primarily to make it easier to be tuned for vision purposes with less chips.

Training another model natively at 448x448 likely would work as well.