In your technical report you articulate that mixing general-purpose data during the ‘Vision Encoder Enhancement’ and ‘Multimodal Decoder Pretraining’ stages was crucial ‘to preserve the visual-language reasoning capabilities’ and ensure MedGemma retained ‘strong general-purpose capabilities’. Curious whether in your development process, you explored or considered scenarios where the models were trained exclusively on medical data? If so, what were the observed or anticipated trade-offs?
Specifically, would such an approach lead to a broader degradation of ‘overall reasoning capabilities,’ or would the impact be more nuanced, primarily affecting the model’s ability to reason about ‘general’ visual-language tasks while potentially enhancing ‘specialized’ medical reasoning?
Training exclusively on medical data, particularly when that volume of data is smaller than the more general data corpus, often leads to overfitting rather than significantly improving medical performance. Additionally, there are use cases even in medicine where general performance is important, e.g., in understanding natural images or other context that is not intrinsically medical but may be part of a broader health journey.
Training only on medical data can improve specialized medical reasoning, but it may reduce general visual-language capabilities. Mixing general-purpose data helps the model retain broader reasoning skills, balancing performance across both general and medical-specific tasks.
Thank you for the clear response to this question. Do you have a sense of whether it might be worthwhile to train a specialist model to optimize performance at a particular vision-related task and use that alongside a more general purpose model that performs other tasks? I know there has been a lot of work at Google regarding general purpose models (e.g. AMIE) and evaluation thereof, but is this something that has been (and more importantly, is still being) explored, ie, would a task-specialized model outperform a general purpose model, whose broad training may improve its performance at the task of interest as well?