A couple of silly questions:
Let’s say I am using MultiHeadAttention in encoder-decoder architecture. In this case, the decoder part of it generates a single output token at a time. If this is correct, then why does the documentation say this:
attention_output
The result of the computation, of shape(B, T, E)
, whereT
is for target sequence shapes andE
is the query input last dimension
Is T == 1?
Also, how does use_causal_mask
work? During inference, in order for TF to know what to mask, it has to know the current time-step (how many output tokens has already been generated) to mask the rest, right? Where does it get the information?