MultiHeadAttention output shape & use_causal_mask

dimanne · April 27, 2023, 12:01am

A couple of silly questions:

Let’s say I am using MultiHeadAttention in encoder-decoder architecture. In this case, the decoder part of it generates a single output token at a time. If this is correct, then why does the documentation say this:

attention_output The result of the computation, of shape (B, T, E), where T is for target sequence shapes and E is the query input last dimension

Is T == 1?

Also, how does use_causal_mask work? During inference, in order for TF to know what to mask, it has to know the current time-step (how many output tokens has already been generated) to mask the rest, right? Where does it get the information?

aniruthraj · September 25, 2024, 5:19am

Hi @dimanne,

As far as I know, the decoder generates one output token at a time but the target sequence shape T value can be more than 1 since the decoder must generate a sequence of tokens.

Causal self-attention ensures that the output for each sequence element only depends on the previous sequence elements not the future ones. Kindly refer this documentation for more information. During inference, the decoder state is being used to check the current step where causal masking is done from there.

Hope this clarifies. Thank You.

Topic		Replies	Views
MultiHeadAttention With 2 Attention Axes And An Attention Mask - How to apply mask General Discussion text-vectorization , tfvectorize	0	111	April 4, 2024
Is tensorflow multi-head attention layer autoregressive? e.g. “tfa.layers.MultiHeadAttention” General Discussion keras , help_request , transformers	1	728	October 18, 2022
How to implement tf.keras.layers.MultiHeadAttention? Keras api , help_request	2	4589	September 29, 2022
Discrepancy code/schematic figure in Tensorflow Tutorial? General Discussion docs , nlp , help_request	8	1455	August 19, 2021
Implement MultiHeadAttention() into an simple Model General Discussion models , help_request	1	978	September 10, 2024

MultiHeadAttention output shape & use_causal_mask

Related topics