Question about causal attention in decoder

Hello, I'm instersted in your work, and I found a question whithout explanation in paper.
I noticed that the causal attention in decoder uses a different structure unlike normal transformers:

1. MAT causal attention uses encoder output as 'Key' and uses decoder self-attention output as 'Query' and 'Value', while normal transformers causal attention use encoder output as 'Query' and 'Value', and use decoder self-attention output as 'Key'. 
2. MAT residual connects the output of encoder after the causal attention, while normal transformers residual connect the output of decoder self-attention.

I have circled this in the figure, is there any reason to change the structure like this?
![E0XC2$0LN9H(MZ9R(X_Q0$X](https://github.com/PKU-MARL/Multi-Agent-Transformer/assets/74708619/98c6358e-c09c-4bb9-9b13-f8e9d23c99e0)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about causal attention in decoder #31

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Question about causal attention in decoder #31

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions