I am going through the code of Transformer model - [here] .
I noticed that in the call method of Decoder class the input encoding is multiplied by the square root of d_model. There is no explanation given for this step. Can someone please explain why this is done in the Decoder class.
Hi @Abhishek_Kishore,
Sorry for the delay in response.
As I’m aware, input embeddings are multiplied by x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
to stabilize training by preventing gradient vanishing issues in attention calculations and without this scaling, the dot products between query and key vectors would grow too large with increasing dimensions, causing the softmax function to output near-zero gradients. Dividing by sqrt(d_model) effectively normalizes these values, keeping them in an optimal range for the softmax operation and enabling effective model training.
Kindly refer this transformers tutorial for more understanding.
Hope this helps.Thank You.