Gemma 3 support for packed sequence training with FlashAttention 2?

SyrineM · January 5, 2026, 3:56pm

Hi everyone,

I have a question about finetuning Gemma 3 models on Hugging Face with sequence packing and FlashAttention 2.

From the recent Hugging Face “packing + FA2” work, it looks like some architectures support training with:

Packed sequences (multiple samples concatenated into a single long sequence)
Variable‑length attention via cumulative sequence lengths (e.g. cu_seqlens) instead of a classic 0/1 padding mask
Proper attention isolation between individual sequences inside the same pack, so that tokens from one sample cannot attend to tokens from another

I would like to clarify how this applies specifically to Gemma 3 models on Hugging Face:

Do the current Gemma 3 implementations (e.g. gemma-3-... on HF) officially support training with packed sequences when using FlashAttention 2?
If yes, what is the expected interface?
- Should the trainer/collator provide a standard binary attention_mask of shape [batch_size, seq_len], and the FA2 integration internally derives cumulative sequence lengths?
- Or is there a supported variant where the model is driven by cumulative sequence lengths / cu_seqlens (e.g. via position_ids or another field) and no binary mask is passed?
Finally, is there any official example or recommended configuration (Trainer/TRL/SFTTrainer + collator) that demonstrates:
- Gemma 3 + FlashAttention2
- Sequence packing
- Correct per‑sequence isolation in a packed batch

I am fine implementing a custom collator (e.g. flattening multiple examples into a single sequence and computing the right metadata), but I would like to align with the intended / supported behavior for Gemma 3 rather than relying on assumptions from other architectures.

Thanks in advance!

Pannaga_J · January 9, 2026, 10:13am

Hi @SyrineM

Gemma-family models are officially supported for sequence packing with FlashAttention 2 in Hugging Face Transformers, as long as the model exposes position_ids, which Gemma does.
In a standard Hugging Face Trainer flow, your collator should provide flattened input_ids with position_ids that reset to 0 at the start of each packed sequence. The model’s FlashAttention-2 integration uses these position resets to internally derive cu_seqlens, so you do not need to manually compute or pass cumulative sequence lengths. and standard binary attention_mask is typically unnecessary.
Please check out these references
Improving Hugging Face Training Efficiency Through Packing with Flash Attention 2
https://ai.google.dev/gemma/docs/core/huggingface_text_full_finetune

Thanks

Tylor_Corwin · February 26, 2026, 3:29am

Hi Syrine,

Great and very detailed question — this is something a lot of people experimenting with Gemma 3 + FlashAttention 2 are currently trying to figure out.

From what I’ve seen, official support for packed sequence training with FA2 in Gemma 3 is still limited and not fully documented yet. Most current setups rely on the standard attention_mask, while FA2 internally optimizes attention computations. There isn’t a clearly documented public interface for directly passing cu_seqlens into Gemma 3 at the moment, unlike some other architectures.

A practical approach for now is:

Implementing a custom collator that handles sequence packing.
Providing a standard attention_mask and letting FA2 handle efficient attention internally.
Verifying isolation by testing attention outputs and ensuring no cross-sample leakage occurs.

You may also find useful references and practical workflows in developer-focused resources like **[telenrquiztoday], which often cover real-world implementation tips and experimental configurations.

Hopefully, Hugging Face or Google will release an official example soon for Gemma 3 + FA2 + packed sequences. Until then, custom collators with careful validation seem to be the safest route.

Hope this helps, good luck with your setup!

Topic		Replies	Views
MedGemma finetuning - padding and labels' masking HAI-DEF colab , medgemma	1	243	September 21, 2025
Inquiry Regarding QAT Version of Gemma-27B Model Gemma models	1	409	December 9, 2025
PaliGemma 2 is here! Gemma model , paligemma	3	770	December 27, 2024
Gemma 4 e4b latency optimisations Gemma pipelines	3	247	May 26, 2026
Seeking Tips & Use Cases for Gemma 3 270M in Flutter Gemma fine-tuning , gemma-3	1	290	August 19, 2025

Gemma 3 support for packed sequence training with FlashAttention 2?

Related topics