RMSNorm - Pre, Within, or Post Norm?

Hey! Enjoyed reading your [paper](https://arxiv.org/pdf/1910.07467) today and congrats on such amazing work. I was wondering if you could help me understand confusion I have between your paper's Section's 3 and 4 compared to your GitHub implementation and the existing PyTorch [implementation](https://pytorch.org/docs/stable/generated/torch.nn.RMSNorm.html). It seems to me the Github's PyTorch and the official PyTorch implementation's operate on a given input agnostic of the input allowing the developer to place the RMSNorm layer after a Linear layer or after a Linear Layer + Relu. However, in the paper it seems the normalization is happening "within" a Linear layer: "... denotes the weight-summed inputs to neurons, which is also the target of normalization." Any chance you could clear that confusion for me? I suppose one way to think about it is to consider what happens when we consider the identity matrix as the weight matrix and a zero bias vector in Section 3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RMSNorm - Pre, Within, or Post Norm? #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RMSNorm - Pre, Within, or Post Norm? #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions