MedGemma finetuning - padding and labels' masking

Hi, I’ve got a few questions about the MedGemma finetuning collab :

1- In the collate function, when creating the labels tensor, only image tokens are masked and not the entire input tokens sequence. Meaning instead of training on the completion only, the model is finetuned on both the prompt and the answer. What is the technical reason behind this ?
2- In generation mode, the tokenizer’s padding side is set to the left. Why is there this change between training and inference, since padding tokens are masked in the attention mask anyway ? Should we do that for validation as well ?

Thank you for your help !

1- The notebook follows the default behavior of the SFTTrainer where the model is trained on both the prompt and answers, which is a standard approach for language modeling. You could also decide to train on completions only by using a prompt-completion dataset. Note that as of TRL v0.22.0 there is native support for training vision-language models using both of these approaches without the need for a custom data collator. You can learn more in the SFT Trainer docs.

2- During inference, padding side needs to be set to left because the model isn’t trained to continue generating from pad tokens. During training (including validation during training), padding side should be set to right when using the SFTTrainer due to an observed issue (potential overflow when training a model in half precision resulting in zero loss).

Hope this helps!