HipKittens MXFP8 GEMM Support by alextmagro · Pull Request #566 · ROCm/TransformerEngine

alextmagro · 2026-04-28T05:16:00Z

Creates an MXFP8 GEMM with HipKittens that outperforms hipBLASlt, and offers additional epilogues such as BIAS and GELU AUX

Requires a workspace sized relative to the model. Often larger than hipBLASlt, but with significant performance improvements. Only builds for gfx950, and requires M / 256 and N / 256.

Adds hipKittens header library as a submodule.

wangye805 · 2026-05-01T15:25:11Z

            )
-        if use_bias:
-            pytest.skip("hipblaslt GEMM does not yet support MXFP8 with bias.")
+        hipkittens_eligible = (m % 256 == 0) and (n % 256 == 0) and (k >= 256)


same hardcoding 256s...

conflicts

ipanfilo · 2026-05-08T16:28:27Z

+  if (!use_mxfp8 && params.force_hipblaslt) {
+    GTEST_SKIP() << "force_hipblaslt only relevant for MXFP8";
+  }
+  if (use_mxfp8) {


Add new const bool use_hipblaslt_fp8 = (!use_mxfp8 || param.force_hipblaslt) - this combination is used below for many skips. And all this should be below, under ifdef HIP_PLATFORM_AMD under has_fp8

I wanted to avoid the skips completely, so split up the test instantiation into non-mxfp8 and mxfp8.

Nevertheless, the same condition is used multiple times below. May be you can rather have use_hipkittens_mxfp8 = (use_mxfp8 && !params.force_hiplaslt) for better clarity

I see what you mean now. We can combine some of the checks to make it easier to read when we are using hipkittens.

ipanfilo · 2026-05-08T16:46:34Z

                         [](const testing::TestParamInfo<DqGEMMTestSuite::ParamType>& info) {
-                           return MKN(std::get<0>(info.param)) + "x" + TN(std::get<3>(info.param));
+                           return MKN(std::get<0>(info.param)) + "x" +
+                                  std::to_string(std::get<1>(info.param)) + "x" +


What is a point, they are set to false only

ipanfilo · 2026-05-08T16:50:48Z

    GTEST_SKIP() << "MXFP8 is not supported in current config";
  }
+  if (params.use_bias || params.use_gelu) {
+    if (params.force_hipblaslt) {


It is skipped below anyway, if add it for future, move it after more generic one

Sorry, this and the Dq test name changes are artifacts from my attempt to enable bias and gelu for this test. I a ran into issues with gelu for the non-fp8 GEMM in hipBLASlt, and decided to just focus on the non-Dq tests. I have reverted things.

ipanfilo · 2026-05-08T16:56:06Z

+#include <hip/hip_runtime.h>
+#include <cstddef>
+
+enum KittensDType {


Is it copied from some hipKittent enum? Put comment then

These values come from the NVTE values -- I have added a comment to that extent.

And where are they used?

They are used L735-750 in mxfp8_gemm.cpp. I have updated those functions to be a bit more defensive, too.

ipanfilo · 2026-05-08T17:15:28Z


-    return torch.empty(get_cublas_workspace_size_bytes(), dtype=torch.uint8, device=device)
+    key = (device, ub, grouped_gemm)
+    ws = _workspace_cache.get(key)


Why we don't rely on torch memory caching?

I have made this change. I will need to run an E2E run to make sure that performance isn't affected, but should be ok given my understanding of torch.empty()

aris134 · 2026-05-13T16:21:14Z

+    size_t sa_tr_bytes = align_up((size_t)M * scale_K, 256);
+    size_t sb_tr_bytes = align_up((size_t)N * scale_K, 256);
+    size_t sa_pk_bytes = align_up((size_t)k_iters * M * sizeof(uint32_t), 256);
+    size_t sb_pk_bytes = (size_t)k_iters * N * sizeof(uint32_t);


For my own understanding, can you explain why sb_pk_bytes does not require 256-alignment like the others?

Here, we are aligning the end of each variable so that the next address is 256 aligned, not the current one. Since sb_pk_bytes is the last address, we don't need to pad.

matthiasdiener · 2026-05-13T18:15:53Z

+
 namespace transformer_engine {
 namespace jax {


Nit: there are a few whitespace-only changes in these files, not sure if they are necessary.

I have removed this, thanks

aris134

LGTM!

ipanfilo · 2026-05-14T15:50:06Z

+            Path(__file__).resolve().parent.parent
+            / "3rdparty" / "hipkittens" / "include" / "kittens.cuh"
+        )
+        if "gfx950" in rocm_archs and hipkittens_header.exists():


Pytorch/JAX extensions do not bear any GPU code but delegate all this to TE core. And kittens are added to TE common too.
Why is this build time setting needed?

This is an artifact from when I was running into issues with CI not finding pybinded functions from hipKittens. The issue was elsewhere, and I forgot to remove this. I will remove it, thanks!

ipanfilo · 2026-05-14T23:02:31Z

    NVTE_CHECK((k % 128) == 0, "GEMM K dimension must be multiple of 128 for MXFP8 scaling (got K=", k, ")");
-    NVTE_CHECK((m % 16) == 0, "GEMM M dimension must be multiple of 16 for MXFP8 scaling (got M=", m, ")");
-    NVTE_CHECK((n % 16) == 0, "GEMM N dimension must be multiple of 16 for MXFP8 scaling (got N=", n, ")");
+    NVTE_CHECK((m % 16)  == 0, "GEMM M dimension must be multiple of 16 for MXFP8 scaling (got M=", m, ")");


It looks like just spacing change. Please revert if it is the case

ipanfilo · 2026-05-14T23:05:52Z

-                 transb, grad, workspace, workspaceSize, alpha, beta, use_split_accumulator,
-                 math_sm_count, use_service_stream ? ss_ctl.stream : stream, handle);
+#ifdef USE_HIPKITTENS_GEMM
+  bool is_mxfp8 = inputA->scaling_mode == NVTE_MXFP8_1D_SCALING


Move it out of ifdef and use in ifs that currently check the same conditon

ipanfilo · 2026-05-14T23:07:49Z

-    NVTE_CHECK((n % 16) == 0, "GEMM N dimension must be multiple of 16 for MXFP8 scaling (got N=", n, ")");
+    NVTE_CHECK((m % 16)  == 0, "GEMM M dimension must be multiple of 16 for MXFP8 scaling (got M=", m, ")");
+    NVTE_CHECK((n % 16)  == 0, "GEMM N dimension must be multiple of 16 for MXFP8 scaling (got N=", n, ")");
+#ifndef USE_HIPKITTENS_GEMM


It is checked below in else branch of hipkittens conditoon

ipanfilo · 2026-05-14T23:09:35Z

+  if (use_hipkittens) {
+    auto param = CanonicalizeGemmInput(*inputA, transa, *inputB, transb, m, n, k);
+
+    hipStream_t s = use_service_stream ? ss_ctl.stream : stream;


the same like with is_mxfp8, no point of having it defined for one branch only

ipanfilo · 2026-05-14T23:16:58Z

  }

  auto [atol, rtol] = getTestTolerances(dtype, has_fp8, use_mxfp8);
+  size_t mismatch_limit = use_mxfp8 ? std::max((size_t)1, params.m * params.n / 1'000'000) : 0;


Unused variable

ipanfilo · 2026-05-15T00:21:26Z

@@ -743,12 +786,15 @@ MAKE_DQ_GEMM_TEST(Testfp8xfp8xfp16, fp8, fp8, fp16)

 INSTANTIATE_TEST_SUITE_P(OperatorTest, DqGEMMTestSuite,


If you end up with having separate prefix for MXFP8, it has to be use for this suite for consistency

ipanfilo · 2026-05-15T00:37:41Z

@@ -30,7 +30,9 @@ std::vector<std::tuple<size_t, size_t, size_t>> test_case_sizes = {

 std::vector<std::tuple<size_t, size_t, size_t>> test_case_sizes_mxfp8 = {


test_case_sizes_mxfp8 is only used for DqGEMMTest, is it intention to add sizes there?

Yes, I wanted to add the minimum possible size that hipKittens supports, which is 256x256x256

ipanfilo · 2026-05-15T00:44:07Z

+  if (!use_mxfp8 && params.force_hipblaslt) {
+    GTEST_SKIP() << "force_hipblaslt only relevant for MXFP8";
+  }
+  if (use_mxfp8) {


Nevertheless, the same condition is used multiple times below. May be you can rather have use_hipkittens_mxfp8 = (use_mxfp8 && !params.force_hiplaslt) for better clarity

ipanfilo · 2026-05-15T00:59:02Z

+#include <hip/hip_runtime.h>
+#include <cstddef>
+
+enum KittensDType {


And where are they used?

ipanfilo · 2026-05-15T01:08:06Z

 num_cublas_streams = get_num_compute_streams()


+def _hipkittens_workspace_bytes(m: int, n: int, k: int, layout: str) -> int:


Should it check for env to figure out if hipKittens is enabled?

Yes, I had the check for pytorch but forgot it for JAX, that is fixed now.

ipanfilo · 2026-05-15T21:40:23Z

+
+    is_mxfp8 = isinstance(A, MXFP8TensorStorage) or isinstance(B, MXFP8TensorStorage)
+    if is_mxfp8 and _use_hipkittens():
+        a_size = A.size() if hasattr(A, "size") and callable(A.size) else A.shape


MXFP8TensorSttorage has callable size(). What other object could be here that require this condition

I was considering a scenario where A or B was not MXFP8, but we always have them both as MXFP8 so I think it is ok to simplify the logic

HipKittens MXFP8 GEMM Support

f9d5ce2

alextmagro requested review from aris134, matthiasdiener and zstreet87 April 28, 2026 05:16

alextmagro requested review from ipanfilo, wangye805 and wenchenvincent as code owners April 28, 2026 05:16

alextmagro added the ci-level 1 CI test level 1 label Apr 28, 2026

wangye805 requested changes May 1, 2026

View reviewed changes

alextmagro added 3 commits May 5, 2026 15:05

Update HipKittens branch after upstream MXFP8 merge

aac5860

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8

c917ed0

Update HipKittens commit and address PR comments

3a91321

alextmagro requested a review from wangye805 May 5, 2026 20:26

alextmagro added 5 commits May 5, 2026 20:26

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8 with

cc719fe

conflicts

Resolve conflicts, ensure fp4 workspace changes are harmonious

fcda154

min workspace size guaranteed

70fba6d

add hipkittens to wheels

455002e

fix issue with gfx942 for unified build

ba60ef5

aris134 reviewed May 6, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

aris134 reviewed May 6, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

aris134 reviewed May 6, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

aris134 reviewed May 6, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

ipanfilo requested changes May 8, 2026

View reviewed changes

alextmagro added 2 commits May 12, 2026 02:59

Cleanup and workspace changes

f72b7b8

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8

731640a

alextmagro requested review from aris134 and ipanfilo May 12, 2026 13:24

alextmagro added 3 commits May 12, 2026 16:56

fix jax import issue

1960c06

Fix autotuning bug

320152e

fix pytorch import

a280cf7

Revert workspace changes to avoid sizing race condition

2a27902

aris134 reviewed May 13, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

Revert C++ workspace change to Python

3d7aaf9

aris134 reviewed May 13, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp

aris134 reviewed May 13, 2026

View reviewed changes

Comment thread transformer_engine/common/gemm/kittens/mxfp8_gemm.cpp Outdated

wangye805 approved these changes May 13, 2026

View reviewed changes

matthiasdiener reviewed May 13, 2026

View reviewed changes

aris134 approved these changes May 13, 2026

View reviewed changes

ipanfilo requested changes May 14, 2026

View reviewed changes

Cleanup style and build_tools relics

824841d

alextmagro requested a review from ipanfilo May 14, 2026 17:18

alextmagro added ci-level 3 CI test level 3 and removed ci-level 1 CI test level 1 labels May 14, 2026

matthiasdiener reviewed May 14, 2026

View reviewed changes

Comment thread transformer_engine/jax/cpp_extensions/gemm.py

Fix whitespaces and comment issues

f66f77c

ipanfilo reviewed May 15, 2026

View reviewed changes

alextmagro added 5 commits May 18, 2026 17:52

Kernel optimizations

0b6e702

Add use_hipkittens_mxfp8 bool to test_cublaslt_gemm.cu

816c752

rocm_gemm.cu cleanup

aaa88d7

Add env check to jax file

e2203c0

Simplify Workspace Check

7648594

alextmagro requested a review from ipanfilo May 18, 2026 20:43

alextmagro added 3 commits May 18, 2026 22:49

Revert kernel optimizations

03f675b

Merge remote-tracking branch 'origin/dev' into hipkittens_mxfp8

f852c22

Readd dropped test code

3b307bb

		@@ -743,12 +786,15 @@ MAKE_DQ_GEMM_TEST(Testfp8xfp8xfp16, fp8, fp8, fp16)

		INSTANTIATE_TEST_SUITE_P(OperatorTest, DqGEMMTestSuite,

		@@ -30,7 +30,9 @@ std::vector<std::tuple<size_t, size_t, size_t>> test_case_sizes = {

		std::vector<std::tuple<size_t, size_t, size_t>> test_case_sizes_mxfp8 = {

		num_cublas_streams = get_num_compute_streams()


		def _hipkittens_workspace_bytes(m: int, n: int, k: int, layout: str) -> int:

Conversation

alextmagro commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alextmagro May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aris134 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alextmagro commented Apr 28, 2026 •

edited

Loading

alextmagro May 13, 2026 •

edited

Loading