In your technical report you articulate that mixing general-purpose data during the ‘Vision Encoder Enhancement’ and ‘Multimodal Decoder Pretraining’ stages was crucial ‘to preserve the visual-language reasoning capabilities’ and ensure MedGemma retained ‘strong general-purpose capabilities’. Curious whether in your development process, you explored or considered scenarios where the models were trained exclusively on medical data? If so, what were the observed or anticipated trade-offs?
Specifically, would such an approach lead to a broader degradation of ‘overall reasoning capabilities,’ or would the impact be more nuanced, primarily affecting the model’s ability to reason about ‘general’ visual-language tasks while potentially enhancing ‘specialized’ medical reasoning?