[Codegen][CPU] Fill in the bf16 and i8 ukernel bodies + e2e tests. by bjacob · Pull Request #24572 · iree-org/iree

bjacob · 2026-06-04T13:55:14Z

Replaces the bf16 and i8-VNNI seeds' stub bodies with real SIMD implementations, both generic over the intrinsics_{m,n,k} unrolling factors and structured like the AMDGPU C ukernel they were adapted from (iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8): accumulators in registers, an outer loop over the K tiles (k_outer), and inside it the (intrinsics_m, intrinsics_n, intrinsics_k) unroll. The intrinsics_* arrive as constants at the inlined call site, so the loops fully unroll and the acc_regs arrays become fixed register files -- the bitcode-LTO equivalent of a C++ template, as the README describes.

bf16 (MMA_X86_AVX512BF16_1x16x2_F32_BF16): one _mm512_dpbf16_ps per (m, n, k), with the LHS K-pair broadcast via set1_ps.
i8 (MMA_X86_AVX512VNNI_16x16x2_I32_I8_CASTI16): the 16x16x2 tile is bit-compatible with the codegen path lowerX86Avx512Vnni16x16x2I8 -- one vpmovsxbw widen of each i8 panel to i16, the vpshufd / vbroadcasti32x4 fan-out, and 16 vpdpwssd over the block-interleaved (rlo, chi, rhi, clo) ACC layout. The i8 ukernel needs -mavx512bw for the widen, so it is added to the VNNI copts.

LLVMCPUSelectUKernels now only selects a ukernel when its bitcode actually exists (via attachUKernelBitcodeOnOp's bool return), so an MMAIntrinsic the cost model picks but for which no seed exists -- e.g. the M<->N-swapped MMA_X86_AVX512BF16_16x1x2_F32_BF16 -- falls back to codegen instead of dangling an undefined symbol.

Adds two execution/numerical tests, the first of the new C-bitcode ukernel path: e2e_matmul_cpu_dt_inner_tiled_llvm_ukernel_bf16_f32 (avx512bf16) and ..._i8_i32 (avx512vnni). Each compiles a data-tiled matmul with --iree-llvmcpu-enable-llvm-ukernels=inner_tiled, links the ukernel bitcode, runs on host and checks results against a reference -- exercising the operand threading and generic intrinsics_{m,n,k} unrolling that the IR-level lit tests cannot. Both were confirmed to actually select their ukernel (not silently fall back to codegen).

Progress towards #24574.

Replaces the bf16 and i8-VNNI seeds' stub bodies with real SIMD implementations, both generic over the `intrinsics_{m,n,k}` unrolling factors and structured like the AMDGPU C ukernel they were adapted from (`iree_uk_amdgpu_multi_mma_mfma_i32_16x16x32_i8`): accumulators in registers, an outer loop over the K tiles (`k_outer`), and inside it the `(intrinsics_m, intrinsics_n, intrinsics_k)` unroll. The `intrinsics_*` arrive as constants at the inlined call site, so the loops fully unroll and the `acc_regs` arrays become fixed register files -- the bitcode-LTO equivalent of a C++ template, as the README describes. - bf16 (`MMA_X86_AVX512BF16_1x16x2_F32_BF16`): one `_mm512_dpbf16_ps` per (m, n, k), with the LHS K-pair broadcast via `set1_ps`. - i8 (`MMA_X86_AVX512VNNI_16x16x2_I32_I8_CASTI16`): the 16x16x2 tile is bit-compatible with the codegen path `lowerX86Avx512Vnni16x16x2I8` -- one `vpmovsxbw` widen of each i8 panel to i16, the `vpshufd` / `vbroadcasti32x4` fan-out, and 16 `vpdpwssd` over the block-interleaved (rlo, chi, rhi, clo) ACC layout. The i8 ukernel needs `-mavx512bw` for the widen, so it is added to the VNNI copts. `LLVMCPUSelectUKernels` now only selects a ukernel when its bitcode actually exists (via `attachUKernelBitcodeOnOp`'s bool return), so an `MMAIntrinsic` the cost model picks but for which no seed exists -- e.g. the M<->N-swapped `MMA_X86_AVX512BF16_16x1x2_F32_BF16` -- falls back to codegen instead of dangling an undefined symbol. Adds two execution/numerical tests, the first of the new C-bitcode ukernel path: `e2e_matmul_cpu_dt_inner_tiled_llvm_ukernel_bf16_f32` (avx512bf16) and `..._i8_i32` (avx512vnni). Each compiles a data-tiled matmul with `--iree-llvmcpu-enable-llvm-ukernels=inner_tiled`, links the ukernel bitcode, runs on host and checks results against a reference -- exercising the operand threading and generic `intrinsics_{m,n,k}` unrolling that the IR-level lit tests cannot. Both were confirmed to actually select their ukernel (not silently fall back to codegen). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>

bjacob mentioned this pull request Jun 4, 2026

Tracking: LLVMCPU C-microkernel framework #24574

Open

bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from 0907e85 to 121f292 Compare June 5, 2026 15:17

bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch from 8af2850 to cb00012 Compare June 5, 2026 15:17

bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from 121f292 to d829f18 Compare June 5, 2026 15:19

bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch from cb00012 to 6cd0524 Compare June 5, 2026 15:19

bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from d829f18 to 6dbb0eb Compare June 5, 2026 18:53

bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch from 6cd0524 to fc522fd Compare June 5, 2026 18:53

bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from 6dbb0eb to a5ed591 Compare June 8, 2026 14:25

bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch from fc522fd to 5fb2905 Compare June 8, 2026 14:25

bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from a5ed591 to e72c8e2 Compare June 8, 2026 14:33

bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch from 5fb2905 to 6ec426f Compare June 8, 2026 14:33

bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from e72c8e2 to bdc6f02 Compare June 8, 2026 15:11

bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch from 6ec426f to 6602e64 Compare June 8, 2026 15:11

bjacob mentioned this pull request Jun 11, 2026

[Codegen][CPU] Seed C-bitcode ukernel framework with bf16 and i8 seeds. #24567

Merged

bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from bdc6f02 to e576a0c Compare June 11, 2026 15:23

bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch 2 times, most recently from 18678c3 to a9f88d9 Compare June 11, 2026 16:27

bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from e576a0c to 2209e59 Compare June 11, 2026 16:27

bjacob marked this pull request as ready for review June 11, 2026 18:45

bjacob requested a review from hanhanW as a code owner June 11, 2026 18:45

bjacob requested a review from egebeysel June 11, 2026 18:46

bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from 2209e59 to 189618e Compare June 16, 2026 20:07

bjacob requested a review from MaheshRavishankar as a code owner June 16, 2026 20:07

bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch from a9f88d9 to b60f897 Compare June 16, 2026 20:07

bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from 189618e to 7422d39 Compare June 17, 2026 13:40

bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch from b60f897 to 6c0ae40 Compare June 17, 2026 13:40

bjacob force-pushed the users/bjacob/cpu-ukernel-pipeline-test branch from 7422d39 to 5cd7454 Compare June 17, 2026 16:03

bjacob force-pushed the users/bjacob/cpu-ukernel-bodies branch from 6c0ae40 to 62ad0cd Compare June 17, 2026 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Codegen][CPU] Fill in the bf16 and i8 ukernel bodies + e2e tests.#24572

[Codegen][CPU] Fill in the bf16 and i8 ukernel bodies + e2e tests.#24572
bjacob wants to merge 1 commit into
users/bjacob/cpu-ukernel-pipeline-testfrom
users/bjacob/cpu-ukernel-bodies

bjacob commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bjacob commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bjacob commented Jun 4, 2026 •

edited

Loading