Create_padding_mask in Transformer code uses encoder input sequence for creating padding mask in 2nd attention block of the decoder

I am going through the Transformer code on tensorflow.org .

def create_masks(self, inp, tar):
    # Encoder padding mask (Used in the 2nd attention block in the decoder too.)
    padding_mask = create_padding_mask(inp)

    # Used in the 1st attention block in the decoder.
    # It is used to pad and mask future tokens in the input received by
    # the decoder.
    look_ahead_mask = create_look_ahead_mask(tf.shape(tar)[1])
    dec_target_padding_mask = create_padding_mask(tar)
    look_ahead_mask = tf.maximum(dec_target_padding_mask, look_ahead_mask)

    return padding_mask, look_ahead_mask

Transformer class has a method called create_masks which creates padding and look ahead mask. I understand that padding mask for encoder should take input sequence(input to the encoder) for creating padding mask. However, what I do not understand is why should the input sequence to the encoder should be used for creating padding mask for second attention block of the decoder(first line of the code). I think the padding mask for decoder should be created using the target sequence(which is fed to the decoder).

Please help me understand why this is done.

2 Likes

Hi @Abhishek_Kishore,

Sorry for the delay in response.

Yes,the padding mask in the second attention block uses the encoder’s input sequence where in this block, the decoder is attending to the encoder’s output. If the encoder’s input had padding tokens, the decoder needs to ignore those when looking at the encoder’s output. So, the encoder’s padding mask ensures the decoder doesn’t focus on the padded tokens and target sequence’s padding mask is only used in the first attention block, where the decoder attends to its own previous tokens.

As I’m aware, using the encoder’s input for creating a padding mask in the decoder ensures that irrelevant padding tokens do not influence attention scores, maintaining effective learning and prediction in the Transformer model.

Hope this clears.Thank You.