Hello, I'm instersted in your work, and I found a question whithout explanation in paper.
I noticed that the causal attention in decoder uses a different structure unlike normal transformers:
- MAT causal attention uses encoder output as 'Key' and uses decoder self-attention output as 'Query' and 'Value', while normal transformers causal attention use encoder output as 'Query' and 'Value', and use decoder self-attention output as 'Key'.
- MAT residual connects the output of encoder after the causal attention, while normal transformers residual connect the output of decoder self-attention.
I have circled this in the figure, is there any reason to change the structure like this?

Hello, I'm instersted in your work, and I found a question whithout explanation in paper.
I noticed that the causal attention in decoder uses a different structure unlike normal transformers:
I have circled this in the figure, is there any reason to change the structure like this?
