Skip to content

Commit f896f19

Browse files
bjacobclaude
andcommitted
[Codegen][CPU] Add design doc for the LLVMCPU C-ukernel framework.
Adds README.md for compiler/plugins/target/LLVMCPU/builtins/ukernel/ as a single final documentation commit, rather than iterating the doc across the earlier framework commits (less churn for review). Covers: scope (the framework is op-agnostic in principle, with `inner_tiled` as the initial and primary case getting the detailed treatment — inner-K-loop-only contract, and how specialization splits between symbol selection on the `MMAIntrinsic` and post-inlining loop unrolling on the `intrinsics_{m,n,k}` factors); the two structural pillars (lives in compiler/, bitcode-only; self-contained IR via `hal.executable_object`); the end-to-end lowering flow; the `--iree-llvmcpu-enable-llvm-ukernels` user flag; bring-your-own-ukernel; an authoring guide; testing at both the lit (compiler IR) and `tests/e2e/matmul` (numerical) levels; and the three seed examples. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Benoit Jacob <jacob.benoit.1@gmail.com>
1 parent b60f897 commit f896f19

1 file changed

Lines changed: 241 additions & 0 deletions

File tree

  • compiler/plugins/target/LLVMCPU/builtins/ukernel
Lines changed: 241 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,241 @@
1+
# LLVMCPU C microkernels
2+
3+
This directory holds **C-based microkernels** (ukernels) for the LLVMCPU
4+
target backend. They are compiled to LLVM bitcode, embedded as static data
5+
in `iree-compile`, and at compile time are copied into the IR as
6+
`hal.executable_object` attributes on the dispatch's HAL executable variant.
7+
That representation is the same as the GPU C ukernels under
8+
`compiler/plugins/target/ROCM/builtins/ukernel/`.
9+
10+
This is **not** the same framework as the legacy CPU ukernels under
11+
`runtime/src/iree/builtins/ukernel/` — see the design pillars below for the
12+
differences. The two coexist; this directory is purely additive.
13+
14+
## Scope
15+
16+
The framework itself is **not tied to any one op**. A ukernel is just a C
17+
function, compiled to bitcode, that codegen may substitute for some piece
18+
of IR it would otherwise lower itself. Any op that codegen knows how to
19+
match and replace with a `iree_codegen.ukernel.generic` call can have a
20+
ukernel; the build/embed/lookup/lowering machinery below is generic.
21+
22+
The **initial and primary** use case — and the only one wired up so far —
23+
is `iree_codegen.inner_tiled`. The rest of this document leans on that case
24+
for its concrete examples, but keep the distinction in mind: the pillars
25+
about *where ukernels live* and *how their bitcode reaches the IR* are
26+
general, while the specifics of *what a ukernel computes* (the inner K
27+
loop, the `intrinsics_{m,n,k}` contract) are particular to `inner_tiled`.
28+
29+
### The `inner_tiled` case
30+
31+
For an `iree_codegen.inner_tiled` op with a
32+
`#iree_cpu.data_tiled_mma_layout`, a ukernel implements the **inner K
33+
loop** of one data-tiled MMA tile. Outer M/N looping is handled by ordinary
34+
IREE tiling *before* the ukernel runs; the ukernel sees a single
35+
(M, N) tile and walks K itself.
36+
37+
A ukernel is specific to **one `MMAIntrinsic`** but **generic over the
38+
`intrinsics_{m,n,k}` unrolling factors**. Those factors are passed to the C
39+
function as ordinary parameters and look like runtime values inside the
40+
ukernel's translation unit — but they are *always* compile-time constants
41+
in the calling context (the caller passes the `DataTiledMMAAttr`'s
42+
constants), and the ukernel is *always* inlined into that caller (anything
43+
else is a bug). Linking the ukernel as bitcode into the dispatch lets the
44+
post-inlining optimizer fully specialize the body to each call site's
45+
concrete `(intrinsics_m, intrinsics_n, intrinsics_k)` triple: loops over
46+
those counts unroll, the unrolled tile lives in registers, and nothing of
47+
the "runtime parameter" framing survives. In effect the bitcode + inlining
48+
+ LTO chain gives C++-template-like specialization without the template
49+
syntax — specialization on the `MMAIntrinsic` happens via *symbol
50+
selection* (one ukernel function per intrinsic), and specialization on the
51+
unrolling factors happens via *post-inlining loop unrolling*.
52+
53+
In particular, this is a **lower-level interface than the legacy
54+
`mmt4d` ukernels**:
55+
56+
| | Legacy (`runtime/.../ukernel/`) | New (this directory) |
57+
|---|---|---|
58+
| What an entry point is | A whole-matmul library: handles arbitrary `mmt4d` shapes, walks outer M/N itself, internally dispatches to arch-variant inner kernels | The inner K-loop of one specific (intrinsic, arch-variant) configuration |
59+
| Outer M/N looping | Inside the ukernel | Outside, done by IREE tiling |
60+
| Per-intrinsic specialization | Hidden behind a dispatching front door | Exposed — one entry point per (intrinsic, arch-variant) |
61+
| Caller responsibility | Almost nothing | Tile to the ukernel's exact data-tiled shape |
62+
63+
## Design pillars
64+
65+
### 1. Lives in `compiler/`, built only as LLVM bitcode
66+
67+
Two structural decisions, both intentionally different from the legacy
68+
framework:
69+
70+
- **`compiler/`, not `runtime/`.** The legacy ukernels live under
71+
`runtime/src/iree/builtins/ukernel/` because at the time they served two
72+
consumers: the llvm-cpu backend *and* VMVX. The VMVX ukernel path is now
73+
effectively defunct, so the runtime-side / shared-with-VMVX
74+
justification no longer applies. The new ukernels are llvm-cpu-only and
75+
live in the plugin that uses them, mirroring the GPU C ukernels under
76+
`compiler/plugins/target/ROCM/builtins/ukernel/`.
77+
78+
- **Bitcode only.** The legacy ukernels are also built with the native
79+
toolchain, to be testable/benchmarkable in isolation via the runtime's
80+
`tools/` directory. That property turned out not to be worth the extra
81+
build complexity. The new ukernels are built only as bitcode, embedded
82+
as static data in `iree-compile`, and tested via the self-contained-IR
83+
property described next.
84+
85+
### 2. IR representation: self-contained via `hal.executable_object`
86+
87+
When `iree-compile` decides to use one of these built-in ukernels, the
88+
ukernel's bitcode is **copied into the IR** as a `hal.executable_object`
89+
attached to the dispatch's executable variant. From that point on the
90+
module is self-contained: everything needed to compile and run the
91+
dispatch is in the IR.
92+
93+
This is **different** from the legacy framework, where the ukernel was an
94+
opaque external symbol the runtime was expected to provide. Two
95+
consequences fall out:
96+
97+
- **Tests are trivially writable.** A lit test is just an MLIR file that
98+
carries its own `hal.executable_object`. The test owns its ukernel
99+
bitecode bytes outright — no runtime-side fixturing, no special CI
100+
wiring, no need to actually rerun the bitcode-library build for the test
101+
to run.
102+
- **Bring-your-own-ukernel is just `hal.executable_object`.** A user who
103+
wants to override or supply their own ukernel attaches their own
104+
`hal.executable_object` to their input MLIR. `iree-compile` honors it on
105+
equal footing with the built-in ones (and the lookup logic, when it
106+
walks up looking for executable objects, finds the user's one first).
107+
No fork of IREE, no runtime patching.
108+
109+
## How a built-in ukernel works end to end
110+
111+
This walks through the `inner_tiled` path; an analogous flow would apply to
112+
any future op the framework grows to cover.
113+
114+
1. **Source.** A C file under this directory implements the ukernel as a
115+
single `__attribute__((always_inline))` function. For `inner_tiled`, the
116+
function takes tile pointers + the `intrinsics_{m,n,k}` unrolling factors
117+
as scalar arguments, and emits the unrolled inner K-loop. The file does
118+
`#include "common.h"` for no-stdlib stdint replacements and pulls in
119+
architecture-specific intrinsic headers (e.g. `<immintrin.h>` for x86).
120+
121+
2. **Build.** A bazel/CMake rule (`iree_bitcode_library` with `ARCH=x86_64`
122+
and feature-specific `COPTS` like `-mavx512bf16`) compiles the C file
123+
to an LLVM bitcode file, e.g.
124+
`iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.x86_64_avx512bf16.bc`.
125+
126+
3. **Embed.** `iree_c_embed_data` collects all such `.bc` files into a
127+
single TOC blob (`iree_uk_cpu_bitcode.{c,h}`), which is linked into
128+
`iree-compile`.
129+
130+
4. **Register at compile time.** On LLVMCPU target init, the TOC is
131+
iterated and the bitcode files are inserted into the global
132+
`EmbeddedDataDirectory`.
133+
134+
5. **Match.** During LLVMCPU's kernel configuration step
135+
(`LLVMCPUSelectUKernels`), an `iree_codegen.inner_tiled` op carrying a
136+
`#iree_cpu.data_tiled_mma_layout` is matched against the available
137+
ukernels by intrinsic + element types + arch + features. If a match is
138+
found, the op gets a `iree_codegen.ukernel = "<name>"` descriptor
139+
attribute set on it, and the matching bitcode is attached.
140+
141+
6. **Lower.** The generic pass
142+
`iree-codegen-lower-bitcode-ukernels` finds annotated ops, looks up the
143+
`iree_codegen.ukernel_provider` (a
144+
`#iree_cpu.ukernel_provider` attribute installed on the target config),
145+
and calls its `createAndReplaceWithUkernelOp`. That method:
146+
- Looks up the matching `hal.executable_object` (first in the
147+
dispatch's existing `hal.executable.objects`, then in the global
148+
`EmbeddedDataDirectory`).
149+
- Attaches it to the executable variant.
150+
- Replaces the op with an `iree_codegen.ukernel.generic` call to the
151+
ukernel, with `fn_def_attrs = {hal.import.bitcode = true}` so the
152+
eventual call resolves directly against the linked bitcode rather
153+
than through the runtime import table.
154+
155+
7. **Lower again.** The existing
156+
`iree-codegen-lower-ukernel-ops-to-calls` pass lowers
157+
`ukernel.generic` to a `func.call` against the (now-linked) bitcode
158+
function. The LLVM `always_inline` attribute then specializes the
159+
ukernel body into each call site at the LLVM optimization stage.
160+
161+
## How to use as an IREE end user
162+
163+
Pass `--iree-llvmcpu-enable-llvm-ukernels=inner_tiled` to `iree-compile`.
164+
The flag takes a comma-separated list of categories to enable (currently
165+
just `inner_tiled`; the name leaves room for future categories). When a
166+
category is enabled, any matching op implemented by one of the built-in
167+
ukernels listed below will be rewritten to a ukernel call, and the bitcode
168+
will appear in the resulting MLIR as a `hal.executable_object` on the
169+
executable variant.
170+
171+
## How to bring your own ukernel
172+
173+
Attach a `hal.executable_object` to the input MLIR carrying your bitcode
174+
(with a function whose name matches a known ukernel name). When
175+
`iree-compile` walks up looking for executable objects, it finds your one
176+
first and uses it in place of the built-in. Built-in ukernels are
177+
overridden seamlessly; new ukernels (functions IREE has no built-in for)
178+
work the same way as long as the user-supplied IR also annotates the
179+
matched op with `iree_codegen.ukernel = "<name>"`.
180+
181+
## How to author a new built-in ukernel
182+
183+
1. Write the C source under this directory. Function name (and file name)
184+
is the corresponding `MMAIntrinsic` enum value, lowercased, with the
185+
`iree_uk_` prefix — e.g. `MMA_X86_AVX512BF16_1x16x2_F32_BF16` becomes
186+
`iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16`. This mirrors the AMDGPU C
187+
ukernels: those carry the AMDGPU intrinsic name (`mfma_i32_16x16x32_i8`,
188+
etc.) verbatim, and on CPU the arch tag (`x86`, ...) is already part of
189+
the intrinsic name, so this convention drops in cleanly. One ukernel per
190+
intrinsic — the `intrinsics_{m,n,k}` unrolling factors are *function
191+
arguments*, not part of the name (see Scope above for why this fully
192+
specializes anyway). Body is `always_inline`, no allocations, no stdlib
193+
(just `common.h` and intrinsic headers). Loops driven by the unrolling
194+
factors should unroll spontaneously after inlining; if Clang doesn't
195+
manage on its own, add a targeted `#pragma clang loop unroll(...)`.
196+
2. Add a `iree_bitcode_library` entry in `BUILD.bazel` / `CMakeLists.txt`
197+
for the new file, with the appropriate `COPTS` feature flags, and add
198+
the resulting `.bc` to the embedded TOC.
199+
3. If the ukernel implements an `MMAIntrinsic` that the codegen pipeline
200+
does not lower, add a new enum value to `IREECPUEnums.td` and minimal
201+
metadata (`getRowMajorTilesMNKShape`, `getABCElementTypes`) in
202+
`IREECPUAttrs.cpp`*but no `lowerX86…` function*, because the
203+
ukernel is the sole implementation by design.
204+
4. Extend `LLVMCPUSelectUKernels.cpp` to recognize the new ukernel by
205+
intrinsic + element types + arch features.
206+
207+
## How to test
208+
209+
Two complementary levels:
210+
211+
- **Lit tests (compiler IR).** Use the self-contained-IR property: write a
212+
lit test that carries the ukernel bitcode directly as a
213+
`hal.executable_object` literal in the MLIR. `iree-opt` running the
214+
lowering passes can verify the resulting IR without ever invoking the
215+
bitcode-library build. See the lit tests under `test/` for examples —
216+
ranging from single-pass checks (`lower_inner_tiled_to_bitcode_ukernel*`,
217+
`select_ukernel`) up to a full-codegen-pipeline check
218+
(`e2e_inner_tiled_pipeline`) that drives an `inner_tiled` dispatch all the
219+
way to a direct `llvm.call` against the ukernel symbol.
220+
221+
- **End-to-end numerical tests (`tests/e2e/matmul`).** For actually running
222+
a ukernel and checking its results against a reference, add a variant to
223+
the matmul e2e runner tests under `tests/e2e/matmul/` (the
224+
`iree_tests_e2e_matmul_*` family), compiling with
225+
`--iree-llvmcpu-enable-llvm-ukernels=inner_tiled`. This is the level that
226+
catches a wrong inner loop — a lit test only checks that the *call* is
227+
emitted, not that the ukernel *computes* the right thing.
228+
229+
## Seed examples
230+
231+
The directory is intentionally **sparse**. Ukernels are no longer the
232+
majority path under the modernized CPU codegen; codegen handles most
233+
cases on its own. The seeds here illustrate two categories that justify
234+
a ukernel at all:
235+
236+
| Seed | What it shows |
237+
|---|---|
238+
| `iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.c` | Clean example. Codegen does the same thing at parity; this is the canonical "simplest new-style ukernel" reference for new authors. |
239+
| `iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.c` | Practically useful. i8×i8→i32 via VNNI is a workhorse for quantized inference, and codegen has a residual perf gap on this case. |
240+
241+
This README is kept up to date as new seeds and framework pieces land.

0 commit comments

Comments
 (0)