Questions from Mars Petcare

  1. We’re particularly interested in how you managed image resolution scaling. Your report notes that the MedSigLIP model, while based on an 896x896 resolution encoder, is released at 448x448 by sharing the same model weights and using down-sampled positional embeddings. Our internal experiments suggest that “naively” scaling image inputs, especially for medical images which tend to have less inherent scale invariance, can be problematic (we tested mostly scaling from 224x224 up to 448448). Could you elaborate on any challenges you faced when determining this scaling approach, or any limitations you observed with this method compared to, say, training a model natively at 448x448 from scratch without shared weights?

Response: The MedSigLIP model is released at 448x448 resolution by sharing the same model weights as its 896x896 encoder and using down-sampled positional embeddings to accommodate fewer input patches from the lower resolution. This approach was chosen for more efficient experimentation and adaptation by the community, as many medical vision tasks were found to perform reasonably well at 448x448 resolution.

  1. You’ve articulated that mixing general-purpose data during the ‘Vision Encoder Enhancement’ and ‘Multimodal Decoder Pretraining’ stages was crucial ‘to preserve the visual-language reasoning capabilities’ and ensure MedGemma retained ‘strong general-purpose capabilities’. We are curious if, during your development process, you explored or considered scenarios where the models were trained exclusively on medical data? If so, what were the observed or anticipated trade-offs? Specifically, would such an approach lead to a broader degradation of ‘overall reasoning capabilities,’ or would the impact be more nuanced, primarily affecting the model’s ability to reason about ‘general’ visual-language tasks while potentially enhancing ‘specialized’ medical reasoning?

Response: During the development of MedGemma, general-purpose data was mixed with medical data in the “Vision Encoder Enhancement” and “Multimodal Decoder Pretraining” stages to preserve visual-language reasoning capabilities and ensure MedGemma retained strong general-purpose capabilities. Our technical report evaluates the trade-offs of specialization by comparing MedGemma’s performance on general, non-medical benchmarks against general models. MedGemma exhibited only minor decreases in performance on these general benchmarks (MMLU Pro, Global MMLU Lite, and MMMU), suggesting that the inclusion of general-purpose data helped maintain these capabilities. An approach trained exclusively on medical data would likely lead to a degradation of overall reasoning capabilities in general visual-language tasks, as the mixing of data was crucial for preserving these broader reasoning skills.