Create_padding_mask in Transformer code uses encoder input sequence for creating padding mask in 2nd attention block of the decoder

Abhishek_Kishore · May 17, 2022, 5:37pm

I am going through the Transformer code on tensorflow.org .

def create_masks(self, inp, tar):
    # Encoder padding mask (Used in the 2nd attention block in the decoder too.)
    padding_mask = create_padding_mask(inp)

    # Used in the 1st attention block in the decoder.
    # It is used to pad and mask future tokens in the input received by
    # the decoder.
    look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
    dec_target_padding_mask = create_padding_mask(tar)
    look_ahead_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)

    return padding_mask, look_ahead_mask

Transformer class has a method called create_masks which creates padding and look ahead mask. I understand that padding mask for encoder should take input sequence(input to the encoder) for creating padding mask. However, what I do not understand is why should the input sequence to the encoder should be used for creating padding mask for second attention block of the decoder(first line of the code). I think the padding mask for decoder should be created using the target sequence(which is fed to the decoder).

Please help me understand why this is done.

aniruthraj · December 12, 2024, 1:44pm

Hi @Abhishek_Kishore,

Sorry for the delay in response.

Yes,the padding mask in the second attention block uses the encoder’s input sequence where in this block, the decoder is attending to the encoder’s output. If the encoder’s input had padding tokens, the decoder needs to ignore those when looking at the encoder’s output. So, the encoder’s padding mask ensures the decoder doesn’t focus on the padded tokens and target sequence’s padding mask is only used in the first attention block, where the decoder attends to its own previous tokens.

As I’m aware, using the encoder’s input for creating a padding mask in the decoder ensures that irrelevant padding tokens do not influence attention scores, maintaining effective learning and prediction in the Transformer model.

Hope this clears.Thank You.

Topic		Replies	Views
MultiHeadAttention With 2 Attention Axes And An Attention Mask - How to apply mask General Discussion text-vectorization , tfvectorize	0	109	April 4, 2024
Does TransformerEncoder layer accept built-in mask? General Discussion api , keras , transformers	1	780	May 8, 2023
MultiHeadAttention output shape & use_causal_mask General Discussion api , keras	1	401	September 25, 2024
Masking propagation through layers Keras models , nlp , transformers	1	674	September 19, 2024
I have been training a decoder based transformer for word generation. But it keeps generating the same words over and over again Keras api , help_request , transformers	1	644	December 20, 2024

Create_padding_mask in Transformer code uses encoder input sequence for creating padding mask in 2nd attention block of the decoder

Related topics