Skip to content

Question about causal attention in decoder #31

@Porthoos

Description

@Porthoos

Hello, I'm instersted in your work, and I found a question whithout explanation in paper.
I noticed that the causal attention in decoder uses a different structure unlike normal transformers:

  1. MAT causal attention uses encoder output as 'Key' and uses decoder self-attention output as 'Query' and 'Value', while normal transformers causal attention use encoder output as 'Query' and 'Value', and use decoder self-attention output as 'Key'.
  2. MAT residual connects the output of encoder after the causal attention, while normal transformers residual connect the output of decoder self-attention.

I have circled this in the figure, is there any reason to change the structure like this?
E0XC2$0LN9H(MZ9R(X_Q0$X

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions