Skip to content

[Codegen][CPU] Fill in the bf16 and i8 ukernel bodies + e2e tests.#24572

Open
bjacob wants to merge 1 commit into
users/bjacob/cpu-ukernel-pipeline-testfrom
users/bjacob/cpu-ukernel-bodies
Open

[Codegen][CPU] Fill in the bf16 and i8 ukernel bodies + e2e tests.#24572
bjacob wants to merge 1 commit into
users/bjacob/cpu-ukernel-pipeline-testfrom
users/bjacob/cpu-ukernel-bodies

Conversation

@bjacob

@bjacob bjacob commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Replaces the bf16 and i8-VNNI seeds' stub bodies with real SIMD implementations, both generic over the intrinsics_{m,n,k} unrolling factors and structured like the AMDGPU C ukernel they were adapted from (iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8): accumulators in registers, an outer loop over the K tiles (k_outer), and inside it the (intrinsics_m, intrinsics_n, intrinsics_k) unroll. The intrinsics_* arrive as constants at the inlined call site, so the loops fully unroll and the acc_regs arrays become fixed register files -- the bitcode-LTO equivalent of a C++ template, as the README describes.

  • bf16 (MMA_X86_AVX512BF16_1x16x2_F32_BF16): one _mm512_dpbf16_ps per (m, n, k), with the LHS K-pair broadcast via set1_ps.
  • i8 (MMA_X86_AVX512VNNI_16x16x2_I32_I8_CASTI16): the 16x16x2 tile is bit-compatible with the codegen path lowerX86Avx512Vnni16x16x2I8 -- one vpmovsxbw widen of each i8 panel to i16, the vpshufd / vbroadcasti32x4 fan-out, and 16 vpdpwssd over the block-interleaved (rlo, chi, rhi, clo) ACC layout. The i8 ukernel needs -mavx512bw for the widen, so it is added to the VNNI copts.

LLVMCPUSelectUKernels now only selects a ukernel when its bitcode actually exists (via attachUKernelBitcodeOnOp's bool return), so an MMAIntrinsic the cost model picks but for which no seed exists -- e.g. the M<->N-swapped MMA_X86_AVX512BF16_16x1x2_F32_BF16 -- falls back to codegen instead of dangling an undefined symbol.

Adds two execution/numerical tests, the first of the new C-bitcode ukernel path: e2e_matmul_cpu_dt_inner_tiled_llvm_ukernel_bf16_f32 (avx512bf16) and ..._i8_i32 (avx512vnni). Each compiles a data-tiled matmul with --iree-llvmcpu-enable-llvm-ukernels=inner_tiled, links the ukernel bitcode, runs on host and checks results against a reference -- exercising the operand threading and generic intrinsics_{m,n,k} unrolling that the IR-level lit tests cannot. Both were confirmed to actually select their ukernel (not silently fall back to codegen).

Progress towards #24574.

@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from 0907e85 to 121f292 Compare June 5, 2026 15:17
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch from 8af2850 to cb00012 Compare June 5, 2026 15:17
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from 121f292 to d829f18 Compare June 5, 2026 15:19
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch from cb00012 to 6cd0524 Compare June 5, 2026 15:19
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from d829f18 to 6dbb0eb Compare June 5, 2026 18:53
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch from 6cd0524 to fc522fd Compare June 5, 2026 18:53
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from 6dbb0eb to a5ed591 Compare June 8, 2026 14:25
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch from fc522fd to 5fb2905 Compare June 8, 2026 14:25
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from a5ed591 to e72c8e2 Compare June 8, 2026 14:33
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch from 5fb2905 to 6ec426f Compare June 8, 2026 14:33
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from e72c8e2 to bdc6f02 Compare June 8, 2026 15:11
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch from 6ec426f to 6602e64 Compare June 8, 2026 15:11
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from bdc6f02 to e576a0c Compare June 11, 2026 15:23
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch 2 times, most recently from 18678c3 to a9f88d9 Compare June 11, 2026 16:27
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from e576a0c to 2209e59 Compare June 11, 2026 16:27
@bjacob bjacob marked this pull request as ready for review June 11, 2026 18:45
@bjacob bjacob requested a review from hanhanW as a code owner June 11, 2026 18:45
@bjacob bjacob requested a review from egebeysel June 11, 2026 18:46
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from 2209e59 to 189618e Compare June 16, 2026 20:07
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch from a9f88d9 to b60f897 Compare June 16, 2026 20:07
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from 189618e to 7422d39 Compare June 17, 2026 13:40
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch from b60f897 to 6c0ae40 Compare June 17, 2026 13:40
Replaces the bf16 and i8-VNNI seeds' stub bodies with real SIMD
implementations, both generic over the `intrinsics_{m,n,k}` unrolling
factors and structured like the AMDGPU C ukernel they were adapted from
(`iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8`): accumulators in
registers, an outer loop over the K tiles (`k_outer`), and inside it the
`(intrinsics_m, intrinsics_n, intrinsics_k)` unroll. The `intrinsics_*`
arrive as constants at the inlined call site, so the loops fully unroll
and the `acc_regs` arrays become fixed register files -- the bitcode-LTO
equivalent of a C++ template, as the README describes.

  - bf16 (`MMA_X86_AVX512BF16_1x16x2_F32_BF16`): one `_mm512_dpbf16_ps`
    per (m, n, k), with the LHS K-pair broadcast via `set1_ps`.
  - i8 (`MMA_X86_AVX512VNNI_16x16x2_I32_I8_CASTI16`): the 16x16x2 tile is
    bit-compatible with the codegen path `lowerX86Avx512Vnni16x16x2I8` --
    one `vpmovsxbw` widen of each i8 panel to i16, the `vpshufd` /
    `vbroadcasti32x4` fan-out, and 16 `vpdpwssd` over the block-interleaved
    (rlo, chi, rhi, clo) ACC layout. The i8 ukernel needs `-mavx512bw` for
    the widen, so it is added to the VNNI copts.

`LLVMCPUSelectUKernels` now only selects a ukernel when its bitcode
actually exists (via `attachUKernelBitcodeOnOp`'s bool return), so an
`MMAIntrinsic` the cost model picks but for which no seed exists -- e.g.
the M<->N-swapped `MMA_X86_AVX512BF16_16x1x2_F32_BF16` -- falls back to
codegen instead of dangling an undefined symbol.

Adds two execution/numerical tests, the first of the new C-bitcode
ukernel path: `e2e_matmul_cpu_dt_inner_tiled_llvm_ukernel_bf16_f32`
(avx512bf16) and `..._i8_i32` (avx512vnni). Each compiles a data-tiled
matmul with `--iree-llvmcpu-enable-llvm-ukernels=inner_tiled`, links the
ukernel bitcode, runs on host and checks results against a reference --
exercising the operand threading and generic `intrinsics_{m,n,k}`
unrolling that the IR-level lit tests cannot. Both were confirmed to
actually select their ukernel (not silently fall back to codegen).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from 7422d39 to 5cd7454 Compare June 17, 2026 16:03
@bjacob bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch from 6c0ae40 to 62ad0cd Compare June 17, 2026 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant