C-based microkernels (ukernels) on the
LLVMCPU backend: plain C functions compiled to LLVM bitcode, embedded in
iree-compile, and copied into the IR as hal.executable_object attributes on
a dispatch's HAL executable variant — the same representation as the GPU C
ukernels under compiler/plugins/target/ROCM/builtins/ukernel/.
This is not the legacy CPU ukernel framework under
runtime/src/iree/builtins/ukernel/; the two coexist and this is purely
additive. Key differences: it lives in compiler/ (llvm-cpu-only; the VMVX
consumer that justified runtime/ is defunct), is built only as bitcode,
and exposes a lower-level interface — the inner K-loop of one data-tiled MMA
tile rather than a whole-matmul library.
Motivation / scope / differences from the legacy CPU ukernels.
| Framework |
Legacy ukernels |
These new ukernels |
| Location in IREE tree |
runtime/src/iree/builtins/ukernel/ |
compiler/plugins/target/LLVMCPU/builtins/ukernel/ |
| Usage assumption |
Thought as majority use case in data-tiled matmuls on CPU because codegen was generally not able to lower mmt4d to intrinsics. |
Thought as minority case in new (inner-tiled) data-tiled matmuls on CPU because codegen is able to lower inner_tiled to intrinsics. |
| Usage |
LLVMCPU, VMVX and standalone tests. |
LLVMCPU only. |
| Build |
Compiled twice: as llvm bitcode and as binary code using the native toolchain (for VMVX and for standalone tests). |
Compiled once as llvm bitcode. |
| MLIR representation |
Ukernel implementation is an external reference, opaque to MLIR code. |
Ukernel implementation is pulled into a HAL executable_object, making MLIR self-contained. |
| How to bring your own ukernels |
Fork the IREE source tree. |
Add your own ukernels as HAL executable_object in your source MLIR. |
| Call overhead and impact on design |
Because of the dual (bitcode / native build) and multiple purposes (LLVMCPU/VMVX/tests), it was unclear to what extent the ukernel dispatch would turn out ot have overhead that needs to be amortized. Even when it became clear that in LLVMCPU the ukernels were inlined into callers, the design was already set in place. |
Ukernels designed from scratch around being always bitcode, always inlined and specialized into caller, seamless, no overhead. |
| Granularity |
Coarser granularity ops, e.g. the mmt4d ukernel has outer loops on M/N |
Finer granularity ops , e.g. the inner_tiled ukernels only have the inner-most loop on K. |
| Genericity |
Because of the binary build, all variants of a ukernel (e.g. narrow matmul shapes) have to be instantiated in the ukernel sources |
Because of the purely bitcode build and late inlining/specialization into the caller, ukernels can take additional parameters that will turn out to specialize them as callers pass constant values (e.g. narrow matmul ukernels largely become a non-issue as long as they are simple truncations of the general tile). |
The initial (and so far only) consumer is iree_codegen.inner_tiled with a
#iree_cpu.data_tiled_mma_layout. There, a ukernel implements the inner K loop
of one data-tiled MMA tile:
PR queue (stacked, bottom → top)
Each PR is a single commit; its base is the previous branch, so each review
shows exactly one commit. Fill in PR numbers as they're opened.
Testing
- Lit (compiler IR): self-contained tests that carry ukernel bitcode as a
hal.executable_object literal, from single-pass checks
(lower_inner_tiled_to_bitcode_ukernel*, select_ukernel) up to a
full-codegen-pipeline check (e2e_inner_tiled_pipeline).
- End-to-end numerical (
tests/e2e/matmul): compile a data-tiled matmul
with --iree-llvmcpu-enable-llvm-ukernels=inner_tiled, link the ukernel
bitcode, run on host, and check against a reference — for both bf16→f32 and
i8→i32 (each confirmed to actually select its ukernel, not silently fall back
to codegen).
C-based microkernels (ukernels) on the
LLVMCPU backend: plain C functions compiled to LLVM bitcode, embedded in
iree-compile, and copied into the IR ashal.executable_objectattributes ona dispatch's HAL executable variant — the same representation as the GPU C
ukernels under
compiler/plugins/target/ROCM/builtins/ukernel/.This is not the legacy CPU ukernel framework under
runtime/src/iree/builtins/ukernel/; the two coexist and this is purelyadditive. Key differences: it lives in
compiler/(llvm-cpu-only; the VMVXconsumer that justified
runtime/is defunct), is built only as bitcode,and exposes a lower-level interface — the inner K-loop of one data-tiled MMA
tile rather than a whole-matmul library.
Motivation / scope / differences from the legacy CPU ukernels.
runtime/src/iree/builtins/ukernel/compiler/plugins/target/LLVMCPU/builtins/ukernel/The initial (and so far only) consumer is
iree_codegen.inner_tiledwith a#iree_cpu.data_tiled_mma_layout. There, a ukernel implements the inner K loopof one data-tiled MMA tile:
PR queue (stacked, bottom → top)
Each PR is a single commit; its base is the previous branch, so each review
shows exactly one commit. Fill in PR numbers as they're opened.
Testing
hal.executable_objectliteral, from single-pass checks(
lower_inner_tiled_to_bitcode_ukernel*,select_ukernel) up to afull-codegen-pipeline check (
e2e_inner_tiled_pipeline).tests/e2e/matmul): compile a data-tiled matmulwith
--iree-llvmcpu-enable-llvm-ukernels=inner_tiled, link the ukernelbitcode, run on host, and check against a reference — for both bf16→f32 and
i8→i32 (each confirmed to actually select its ukernel, not silently fall back
to codegen).