Skip to content

Tracking: LLVMCPU C-microkernel framework #24574

@bjacob

Description

@bjacob

C-based microkernels (ukernels) on the
LLVMCPU backend
: plain C functions compiled to LLVM bitcode, embedded in
iree-compile, and copied into the IR as hal.executable_object attributes on
a dispatch's HAL executable variant — the same representation as the GPU C
ukernels under compiler/plugins/target/ROCM/builtins/ukernel/.

This is not the legacy CPU ukernel framework under
runtime/src/iree/builtins/ukernel/; the two coexist and this is purely
additive. Key differences: it lives in compiler/ (llvm-cpu-only; the VMVX
consumer that justified runtime/ is defunct), is built only as bitcode,
and exposes a lower-level interface — the inner K-loop of one data-tiled MMA
tile rather than a whole-matmul library.

Motivation / scope / differences from the legacy CPU ukernels.

Framework Legacy ukernels These new ukernels
Location in IREE tree runtime/src/iree/builtins/ukernel/ compiler/plugins/target/LLVMCPU/builtins/ukernel/
Usage assumption Thought as majority use case in data-tiled matmuls on CPU because codegen was generally not able to lower mmt4d to intrinsics. Thought as minority case in new (inner-tiled) data-tiled matmuls on CPU because codegen is able to lower inner_tiled to intrinsics.
Usage LLVMCPU, VMVX and standalone tests. LLVMCPU only.
Build Compiled twice: as llvm bitcode and as binary code using the native toolchain (for VMVX and for standalone tests). Compiled once as llvm bitcode.
MLIR representation Ukernel implementation is an external reference, opaque to MLIR code. Ukernel implementation is pulled into a HAL executable_object, making MLIR self-contained.
How to bring your own ukernels Fork the IREE source tree. Add your own ukernels as HAL executable_object in your source MLIR.
Call overhead and impact on design Because of the dual (bitcode / native build) and multiple purposes (LLVMCPU/VMVX/tests), it was unclear to what extent the ukernel dispatch would turn out ot have overhead that needs to be amortized. Even when it became clear that in LLVMCPU the ukernels were inlined into callers, the design was already set in place. Ukernels designed from scratch around being always bitcode, always inlined and specialized into caller, seamless, no overhead.
Granularity Coarser granularity ops, e.g. the mmt4d ukernel has outer loops on M/N Finer granularity ops , e.g. the inner_tiled ukernels only have the inner-most loop on K.
Genericity Because of the binary build, all variants of a ukernel (e.g. narrow matmul shapes) have to be instantiated in the ukernel sources Because of the purely bitcode build and late inlining/specialization into the caller, ukernels can take additional parameters that will turn out to specialize them as callers pass constant values (e.g. narrow matmul ukernels largely become a non-issue as long as they are simple truncations of the general tile).

The initial (and so far only) consumer is iree_codegen.inner_tiled with a
#iree_cpu.data_tiled_mma_layout. There, a ukernel implements the inner K loop
of one data-tiled MMA tile:

PR queue (stacked, bottom → top)

Each PR is a single commit; its base is the previous branch, so each review
shows exactly one commit. Fill in PR numbers as they're opened.

Testing

  • Lit (compiler IR): self-contained tests that carry ukernel bitcode as a
    hal.executable_object literal, from single-pass checks
    (lower_inner_tiled_to_bitcode_ukernel*, select_ukernel) up to a
    full-codegen-pipeline check (e2e_inner_tiled_pipeline).
  • End-to-end numerical (tests/e2e/matmul): compile a data-tiled matmul
    with --iree-llvmcpu-enable-llvm-ukernels=inner_tiled, link the ukernel
    bitcode, run on host, and check against a reference — for both bf16→f32 and
    i8→i32 (each confirmed to actually select its ukernel, not silently fall back
    to codegen).

Metadata

Metadata

Assignees

Labels

codegen/llvmLLVM code generation compiler backend

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions