Hey! Enjoyed reading your paper today and congrats on such amazing work. I was wondering if you could help me understand confusion I have between your paper's Section's 3 and 4 compared to your GitHub implementation and the existing PyTorch implementation. It seems to me the Github's PyTorch and the official PyTorch implementation's operate on a given input agnostic of the input allowing the developer to place the RMSNorm layer after a Linear layer or after a Linear Layer + Relu. However, in the paper it seems the normalization is happening "within" a Linear layer: "... denotes the weight-summed inputs to neurons, which is also the target of normalization." Any chance you could clear that confusion for me? I suppose one way to think about it is to consider what happens when we consider the identity matrix as the weight matrix and a zero bias vector in Section 3.
Hey! Enjoyed reading your paper today and congrats on such amazing work. I was wondering if you could help me understand confusion I have between your paper's Section's 3 and 4 compared to your GitHub implementation and the existing PyTorch implementation. It seems to me the Github's PyTorch and the official PyTorch implementation's operate on a given input agnostic of the input allowing the developer to place the RMSNorm layer after a Linear layer or after a Linear Layer + Relu. However, in the paper it seems the normalization is happening "within" a Linear layer: "... denotes the weight-summed inputs to neurons, which is also the target of normalization." Any chance you could clear that confusion for me? I suppose one way to think about it is to consider what happens when we consider the identity matrix as the weight matrix and a zero bias vector in Section 3.