RMS Norm Optimization by aris134 · Pull Request #583 · ROCm/TransformerEngine

aris134 · 2026-05-12T12:13:34Z

Description

Fixes # (16527)

RMSNorm falls back to general kernel implementation on several DeepSeek and Qwen shapes, causing poor performance. These shapes have been registered with the tuned kernel cache, and a performance benchmark for RMSNorm has been added.

Additionally, a fallback warning is printed the first time at which a tuned config is not found for a requested kernel. For example:

in function getKernel: Falling back to general normalization kernel because no tuned kernel is available for this config. hidden_size=128, wtype=bf16, itype=bf16, otype=bf16, ctype=fp32

E2E TFLOPS/s/GPU for proxy models (Previous -> Current with RMSNorm tuning) :

Qwen:
bf16: 369.4 -> 374.7
fp8: 352.1 ->358.2

Deepseek:
bf16: 501.4 -> 529.4
fp8: 463.9 -> 511.4

Also added matching tuned configs for LayerNorm.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…tyle

…ot found

… missing configs for layer norm

ipanfilo · 2026-05-15T01:24:06Z

+                     prop.multiProcessorCount, zero_centered_gamma, stream);
+  }
+
+  HIP_CHECK(hipStreamSynchronize(stream));


Is synchronization needed before warmup?

Good point. These are in fact redundant since the warmup already calls a device-wide sync anyway. Removed in 4256e3c

ipanfilo · 2026-05-15T01:25:43Z

 #include <typeindex>
 #include <unordered_map>
 #include <vector>
+#include <unordered_set>


nit: move it after unordered_map

Done in 2f9ff47

ipanfilo · 2026-05-15T01:41:04Z

                     bool is_tuned, NVTEScalingMode mode = NVTE_DELAYED_TENSOR_SCALING,
                     bool training = true, bool gamma_in_weight_dtype = false);

+inline DType decode_itype(uint64_t general_key) {


This code is fragile because encoding could change. At least put comments here and at encoding block that they should match

Good point. I updated this in d548d54 to make the coupling between encoding/decoding explicit by introducing shared norm_key bit-layout constants and using them in both get_key() and the decode helpers. I also added comments documenting that the layouts must remain in sync, so future changes to the packed key format are less likely to silently diverge.

…tion.cpp

alextmagro · 2026-05-16T00:54:14Z

 REGISTER_NORM_LAUNCHER(LayerNorm, Forward, tuned, 6144, bf16, bf16, bf16, fp32, 1, 1, 4, 16);
 REGISTER_NORM_LAUNCHER(LayerNorm, Forward, tuned, 6144, fp32, fp32, bf16, fp32, 1, 1, 4, 16);

+REGISTER_NORM_LAUNCHER(LayerNorm, Forward, tuned, 7168, fp32, fp32, fp32, fp32, 1, 1, 4, 16);


BWD you have 7 warps set, but here you have 4. Is this optimal?

For BWD, 7 warps is indeed more performant than 4 warps across all DTypes tested. For FWD, 7 warps gives a performance boost to fp16/fp32, but regresses on bf16, so I kept 4 warps for bf16 for the h=7168 config. See 3d22d82

Did not see the same performance boost for RMSNorm forward however, so I left warps=4 there

alextmagro · 2026-05-16T00:57:27Z

-                         (uint64_t(NormStage)) << 22 | (uint64_t(NormBackend) << 24) |
-                         (uint64_t(zero_centered_gamma) << 26) | (uint64_t(mode) << 27) |
-                         (uint64_t(training) << 37) | (uint64_t(gamma_in_weight_dtype) << 38);
+  uint64_t general_key =


I get the motivation behind this change, but this affects upstream code. I feel like we're more likely to miss a key change from upstream if we have diverged here.

Good point — I think I overcorrected this by refactoring the key layout into shared named constants, which does increase divergence from upstream and could make future key-layout changes easier to miss during syncs.

I've reverted the get_key() refactor back to the original upstream-style encoding layout and instead added explicit comments at both the encoding and decode sites documenting that the bit layouts must stay in sync. See 0949b9a

Why not add a static assert at the decode so the code doesn't compile for us if upstream changes the encoding?

I'm not sure how to accomplish this without modifying the get_key definition itself (e.g. introducing a shared layout definition/helper used by both get_key and the decode helpers). Otherwise the decode functions only have their own hardcoded assumptions about the bit layout and cannot independently detect drift in get_key.

Right... We'd have to make encoder a constexpr and move to header. That would be ideal, but ends up with more upstream changes anyway.

Actually, could we add a runtime check instead of a static assert that verifies the round trip is valid? Maybe something like this:

namespace { [[maybe_unused]] const bool kNormKeyLayoutCheck = [] { auto [key, b, h, t] = get_key( NVTE_Norm_Backend::Te, NVTE_Norm_Type::RMSNorm, NVTE_Norm_Stage::Forward, DType::kFloat16,DType::kBFloat16, DType::kFloat8E4M3, DType::kFloat32, 1, 1, false, false); NVTE_CHECK(decode_itype(key) == DType::kBFloat16); NVTE_CHECK(decode_otype(key) == DType::kFloat8E4M3); NVTE_CHECK(decode_ctype(key) == DType::kFloat32); NVTE_CHECK(decode_wtype(key) == DType::kFloat16); NVTE_CHECK(decode_norm_type(key) == NVTE_Norm_Type::RMSNorm); return true; }(); }

This is helpful, thanks for the suggestion. I've gone ahead and implemented this in db7b017

…g for H=7168

…wd tuned

alextmagro

LGTM! After merge, please work with Sudharshan to run the E2E configs again and get updated performance numbers

aris134 added 2 commits May 11, 2026 14:53

add rmsnorm perf benchmark

f639c6e

add rmsnorm perf benchmark and missing tuned DS configs

b5720e9

aris134 requested a review from alextmagro May 12, 2026 12:13

aris134 self-assigned this May 12, 2026

aris134 added 2 commits May 12, 2026 12:38

add missing tuned shape for Qwen and update benchmark

6c2cd28

move benchmark to benchmarks folder and rewrite in google benchmark s…

912c62b

…tyle

aris134 marked this pull request as ready for review May 12, 2026 19:15

aris134 requested review from ipanfilo, wangye805 and wenchenvincent as code owners May 12, 2026 19:15

aris134 added 5 commits May 12, 2026 19:44

add matching tuned configs for layernorm

c24a091

add fallback warning print if tuned config not found for normalization

3d8e1de

add fallback warning message when tuned normalization kernel config n…

856346d

…ot found

uncomment qwen configs

d5293cb

generalization rms norm benchmark to also include layer norm, and add…

78f9aa5

… missing configs for layer norm

ipanfilo reviewed May 15, 2026

View reviewed changes

aris134 added 3 commits May 15, 2026 16:00

remove redundant synchronization before gpu warmup in bench_normaliza…

4256e3c

…tion.cpp

address nit: move unordered_set to after unordered_map

2f9ff47

share normalization key bit layout constants

d548d54

aris134 requested a review from ipanfilo May 15, 2026 16:40

alextmagro requested changes May 16, 2026

View reviewed changes

aris134 added 5 commits May 18, 2026 13:30

remove unneeded line splits and deduplicate norm benchmark epsilon

b73062c

restore norm key encoding and document decode coupling

0949b9a

use more optimal BYTES_PER_LDG=16 for layer norm backward tuned confi…

e8afcc8

…g for H=7168

use more optimal 7 warps config for H=7168, for fp16/fp32 layernorm f…

3d22d82

…wd tuned

mirror layernorm bwd h=7168 uned config to rmsnorm

63d0e26

aris134 requested a review from alextmagro May 18, 2026 16:57

alextmagro approved these changes May 18, 2026

View reviewed changes

aris134 added 2 commits May 18, 2026 19:23

Add normalization key layout round-trip check

db7b017

Merge remote-tracking branch 'origin/dev' into amartin/rmsnorm

62d38e1

aris134 merged commit 5cb098b into dev May 18, 2026
2 checks passed

Conversation

aris134 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alextmagro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aris134 commented May 12, 2026 •

edited

Loading