Why is input embedding(x) multiplied by sqrt(d_model) in decoder class of Transformer

Abhishek_Kishore · May 16, 2022, 2:29am

I am going through the code of Transformer model - [here] .

I noticed that in the call method of Decoder class the input encoding is multiplied by the square root of d_model. There is no explanation given for this step. Can someone please explain why this is done in the Decoder class.

aniruthraj · January 10, 2025, 2:54pm

Hi @Abhishek_Kishore,

Sorry for the delay in response.
As I’m aware, input embeddings are multiplied by x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32)) to stabilize training by preventing gradient vanishing issues in attention calculations and without this scaling, the dot products between query and key vectors would grow too large with increasing dimensions, causing the softmax function to output near-zero gradients. Dividing by sqrt(d_model) effectively normalizes these values, keeping them in an optimal range for the softmax operation and enabling effective model training.
Kindly refer this transformers tutorial for more understanding.

Hope this helps.Thank You.

Topic		Replies	Views
Create_padding_mask in Transformer code uses encoder input sequence for creating padding mask in 2nd attention block of the decoder General Discussion models , help_request , transformers	1	1275	December 12, 2024
Embedding weights tied to projection out logits General Discussion help_request	1	599	November 22, 2024
Though Training accuracy is high performance on training data during inference in transformer translation is poor General Discussion models , transformers	0	609	June 9, 2023
Discrepancy code/schematic figure in Tensorflow Tutorial? General Discussion docs , nlp , help_request	8	1461	August 19, 2021
Transformer transalation General Discussion help_request	1	274	September 18, 2023

Why is input embedding(x) multiplied by sqrt(d_model) in decoder class of Transformer

Related topics