Tracking: LLVMCPU C-microkernel framework

**C-based microkernels (ukernels) on the
LLVMCPU backend**: plain C functions compiled to LLVM bitcode, embedded in
`iree-compile`, and copied into the IR as `hal.executable_object` attributes on
a dispatch's HAL executable variant — the same representation as the GPU C
ukernels under `compiler/plugins/target/ROCM/builtins/ukernel/`.

This is **not** the legacy CPU ukernel framework under
`runtime/src/iree/builtins/ukernel/`; the two coexist and this is purely
additive. Key differences: it lives in `compiler/` (llvm-cpu-only; the VMVX
consumer that justified `runtime/` is defunct), is built **only as bitcode**,
and exposes a **lower-level interface** — the inner K-loop of one data-tiled MMA
tile rather than a whole-matmul library.

## Motivation / scope / differences from the legacy CPU ukernels.

Framework  | Legacy ukernels | These new ukernels
--- | --- | ---
Location in IREE tree | `runtime/src/iree/builtins/ukernel/` | `compiler/plugins/target/LLVMCPU/builtins/ukernel/`
Usage assumption | Thought as *majority* use case in data-tiled matmuls on CPU because codegen was generally *not* able to lower mmt4d to intrinsics. | Thought as *minority* case in new (inner-tiled) data-tiled matmuls on CPU because codegen *is* able to lower inner_tiled to intrinsics.
Usage | LLVMCPU, VMVX and standalone tests. | LLVMCPU only.
Build | Compiled *twice*: as llvm bitcode and as binary code using the native toolchain (for VMVX and for standalone tests). | Compiled *once* as llvm bitcode.
MLIR representation | Ukernel implementation is an external reference, opaque to MLIR code. | Ukernel implementation is pulled into a HAL executable_object, making MLIR self-contained.
How to bring your own ukernels | Fork the IREE source tree. | Add your own ukernels as HAL executable_object in your source MLIR.
Call overhead and impact on design | Because of the dual (bitcode / native build) and multiple purposes (LLVMCPU/VMVX/tests), it was unclear to what extent the ukernel dispatch would turn out ot have overhead that needs to be amortized. Even when it became clear that in LLVMCPU the ukernels were inlined into callers, the design was already set in place. | Ukernels designed from scratch around being always bitcode, always inlined and specialized into caller, seamless, no overhead. 
Granularity | Coarser granularity ops, e.g. the mmt4d ukernel has outer loops on M/N | Finer granularity ops , e.g. the inner_tiled ukernels only have the inner-most loop on K.
Genericity | Because of the binary build, all variants of a ukernel (e.g. narrow matmul shapes) have to be instantiated in the ukernel sources | Because of the purely bitcode build and late inlining/specialization into the caller, ukernels can take additional parameters that will turn out to specialize them as callers pass constant values (e.g. narrow matmul ukernels largely become a non-issue as long as they are simple truncations of the general tile).

The initial (and so far only) consumer is **`iree_codegen.inner_tiled`** with a
`#iree_cpu.data_tiled_mma_layout`. There, a ukernel implements the inner K loop
of one data-tiled MMA tile:

## PR queue (stacked, bottom → top)

Each PR is a single commit; its base is the previous branch, so each review
shows exactly one commit. Fill in PR numbers as they're opened.

- #24586
- #24567
- #24568
- #24569
- #24570
- #24571
- #24572
- #24573

## Testing

- **Lit (compiler IR):** self-contained tests that carry ukernel bitcode as a
  `hal.executable_object` literal, from single-pass checks
  (`lower_inner_tiled_to_bitcode_ukernel*`, `select_ukernel`) up to a
  full-codegen-pipeline check (`e2e_inner_tiled_pipeline`).
- **End-to-end numerical (`tests/e2e/matmul`):** compile a data-tiled matmul
  with `--iree-llvmcpu-enable-llvm-ukernels=inner_tiled`, link the ukernel
  bitcode, run on host, and check against a reference — for both bf16→f32 and
  i8→i32 (each confirmed to actually select its ukernel, not silently fall back
  to codegen).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking: LLVMCPU C-microkernel framework #24574

Motivation / scope / differences from the legacy CPU ukernels.

PR queue (stacked, bottom → top)

Testing

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Framework	Legacy ukernels	These new ukernels
Location in IREE tree	`runtime/src/iree/builtins/ukernel/`	`compiler/plugins/target/LLVMCPU/builtins/ukernel/`
Usage assumption	Thought as majority use case in data-tiled matmuls on CPU because codegen was generally not able to lower mmt4d to intrinsics.	Thought as minority case in new (inner-tiled) data-tiled matmuls on CPU because codegen is able to lower inner_tiled to intrinsics.
Usage	LLVMCPU, VMVX and standalone tests.	LLVMCPU only.
Build	Compiled twice: as llvm bitcode and as binary code using the native toolchain (for VMVX and for standalone tests).	Compiled once as llvm bitcode.
MLIR representation	Ukernel implementation is an external reference, opaque to MLIR code.	Ukernel implementation is pulled into a HAL executable_object, making MLIR self-contained.
How to bring your own ukernels	Fork the IREE source tree.	Add your own ukernels as HAL executable_object in your source MLIR.
Call overhead and impact on design	Because of the dual (bitcode / native build) and multiple purposes (LLVMCPU/VMVX/tests), it was unclear to what extent the ukernel dispatch would turn out ot have overhead that needs to be amortized. Even when it became clear that in LLVMCPU the ukernels were inlined into callers, the design was already set in place.	Ukernels designed from scratch around being always bitcode, always inlined and specialized into caller, seamless, no overhead.
Granularity	Coarser granularity ops, e.g. the mmt4d ukernel has outer loops on M/N	Finer granularity ops , e.g. the inner_tiled ukernels only have the inner-most loop on K.
Genericity	Because of the binary build, all variants of a ukernel (e.g. narrow matmul shapes) have to be instantiated in the ukernel sources	Because of the purely bitcode build and late inlining/specialization into the caller, ukernels can take additional parameters that will turn out to specialize them as callers pass constant values (e.g. narrow matmul ukernels largely become a non-issue as long as they are simple truncations of the general tile).

Tracking: LLVMCPU C-microkernel framework #24574

Description

Motivation / scope / differences from the legacy CPU ukernels.

PR queue (stacked, bottom → top)

Testing

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions