I looked at the difference between an autoregressive vs non-autoregressive in transformer architecture. but I am wondering whether the attention layer in TensorFlow is actually autoregressive? or do I need to implement the autoregressive mechanism?
I don’t see any option for causal (e.g. causal=true/false) or whether “tfa.layers.MultiHeadAttention” is autoregressive or not
Any thoughts on that would be appreciated.