|
| 1 | +# LLVMCPU C microkernels |
| 2 | + |
| 3 | +This directory holds **C-based microkernels** (ukernels) for the LLVMCPU |
| 4 | +target backend. They are compiled to LLVM bitcode, embedded as static data |
| 5 | +in `iree-compile`, and at compile time are copied into the IR as |
| 6 | +`hal.executable_object` attributes on the dispatch's HAL executable variant. |
| 7 | +That representation is the same as the GPU C ukernels under |
| 8 | +`compiler/plugins/target/ROCM/builtins/ukernel/`. |
| 9 | + |
| 10 | +This is **not** the same framework as the legacy CPU ukernels under |
| 11 | +`runtime/src/iree/builtins/ukernel/` — see the design pillars below for the |
| 12 | +differences. The two coexist; this directory is purely additive. |
| 13 | + |
| 14 | +## Scope |
| 15 | + |
| 16 | +The framework itself is **not tied to any one op**. A ukernel is just a C |
| 17 | +function, compiled to bitcode, that codegen may substitute for some piece |
| 18 | +of IR it would otherwise lower itself. Any op that codegen knows how to |
| 19 | +match and replace with a `iree_codegen.ukernel.generic` call can have a |
| 20 | +ukernel; the build/embed/lookup/lowering machinery below is generic. |
| 21 | + |
| 22 | +The **initial and primary** use case — and the only one wired up so far — |
| 23 | +is `iree_codegen.inner_tiled`. The rest of this document leans on that case |
| 24 | +for its concrete examples, but keep the distinction in mind: the pillars |
| 25 | +about *where ukernels live* and *how their bitcode reaches the IR* are |
| 26 | +general, while the specifics of *what a ukernel computes* (the inner K |
| 27 | +loop, the `intrinsics_{m,n,k}` contract) are particular to `inner_tiled`. |
| 28 | + |
| 29 | +### The `inner_tiled` case |
| 30 | + |
| 31 | +For an `iree_codegen.inner_tiled` op with a |
| 32 | +`#iree_cpu.data_tiled_mma_layout`, a ukernel implements the **inner K |
| 33 | +loop** of one data-tiled MMA tile. Outer M/N looping is handled by ordinary |
| 34 | +IREE tiling *before* the ukernel runs; the ukernel sees a single |
| 35 | +(M, N) tile and walks K itself. |
| 36 | + |
| 37 | +A ukernel is specific to **one `MMAIntrinsic`** but **generic over the |
| 38 | +`intrinsics_{m,n,k}` unrolling factors**. Those factors are passed to the C |
| 39 | +function as ordinary parameters and look like runtime values inside the |
| 40 | +ukernel's translation unit — but they are *always* compile-time constants |
| 41 | +in the calling context (the caller passes the `DataTiledMMAAttr`'s |
| 42 | +constants), and the ukernel is *always* inlined into that caller (anything |
| 43 | +else is a bug). Linking the ukernel as bitcode into the dispatch lets the |
| 44 | +post-inlining optimizer fully specialize the body to each call site's |
| 45 | +concrete `(intrinsics_m, intrinsics_n, intrinsics_k)` triple: loops over |
| 46 | +those counts unroll, the unrolled tile lives in registers, and nothing of |
| 47 | +the "runtime parameter" framing survives. In effect the bitcode + inlining |
| 48 | ++ LTO chain gives C++-template-like specialization without the template |
| 49 | +syntax — specialization on the `MMAIntrinsic` happens via *symbol |
| 50 | +selection* (one ukernel function per intrinsic), and specialization on the |
| 51 | +unrolling factors happens via *post-inlining loop unrolling*. |
| 52 | + |
| 53 | +In particular, this is a **lower-level interface than the legacy |
| 54 | +`mmt4d` ukernels**: |
| 55 | + |
| 56 | +| | Legacy (`runtime/.../ukernel/`) | New (this directory) | |
| 57 | +|---|---|---| |
| 58 | +| What an entry point is | A whole-matmul library: handles arbitrary `mmt4d` shapes, walks outer M/N itself, internally dispatches to arch-variant inner kernels | The inner K-loop of one specific (intrinsic, arch-variant) configuration | |
| 59 | +| Outer M/N looping | Inside the ukernel | Outside, done by IREE tiling | |
| 60 | +| Per-intrinsic specialization | Hidden behind a dispatching front door | Exposed — one entry point per (intrinsic, arch-variant) | |
| 61 | +| Caller responsibility | Almost nothing | Tile to the ukernel's exact data-tiled shape | |
| 62 | + |
| 63 | +## Design pillars |
| 64 | + |
| 65 | +### 1. Lives in `compiler/`, built only as LLVM bitcode |
| 66 | + |
| 67 | +Two structural decisions, both intentionally different from the legacy |
| 68 | +framework: |
| 69 | + |
| 70 | +- **`compiler/`, not `runtime/`.** The legacy ukernels live under |
| 71 | + `runtime/src/iree/builtins/ukernel/` because at the time they served two |
| 72 | + consumers: the llvm-cpu backend *and* VMVX. The VMVX ukernel path is now |
| 73 | + effectively defunct, so the runtime-side / shared-with-VMVX |
| 74 | + justification no longer applies. The new ukernels are llvm-cpu-only and |
| 75 | + live in the plugin that uses them, mirroring the GPU C ukernels under |
| 76 | + `compiler/plugins/target/ROCM/builtins/ukernel/`. |
| 77 | + |
| 78 | +- **Bitcode only.** The legacy ukernels are also built with the native |
| 79 | + toolchain, to be testable/benchmarkable in isolation via the runtime's |
| 80 | + `tools/` directory. That property turned out not to be worth the extra |
| 81 | + build complexity. The new ukernels are built only as bitcode, embedded |
| 82 | + as static data in `iree-compile`, and tested via the self-contained-IR |
| 83 | + property described next. |
| 84 | + |
| 85 | +### 2. IR representation: self-contained via `hal.executable_object` |
| 86 | + |
| 87 | +When `iree-compile` decides to use one of these built-in ukernels, the |
| 88 | +ukernel's bitcode is **copied into the IR** as a `hal.executable_object` |
| 89 | +attached to the dispatch's executable variant. From that point on the |
| 90 | +module is self-contained: everything needed to compile and run the |
| 91 | +dispatch is in the IR. |
| 92 | + |
| 93 | +This is **different** from the legacy framework, where the ukernel was an |
| 94 | +opaque external symbol the runtime was expected to provide. Two |
| 95 | +consequences fall out: |
| 96 | + |
| 97 | +- **Tests are trivially writable.** A lit test is just an MLIR file that |
| 98 | + carries its own `hal.executable_object`. The test owns its ukernel |
| 99 | + bitecode bytes outright — no runtime-side fixturing, no special CI |
| 100 | + wiring, no need to actually rerun the bitcode-library build for the test |
| 101 | + to run. |
| 102 | +- **Bring-your-own-ukernel is just `hal.executable_object`.** A user who |
| 103 | + wants to override or supply their own ukernel attaches their own |
| 104 | + `hal.executable_object` to their input MLIR. `iree-compile` honors it on |
| 105 | + equal footing with the built-in ones (and the lookup logic, when it |
| 106 | + walks up looking for executable objects, finds the user's one first). |
| 107 | + No fork of IREE, no runtime patching. |
| 108 | + |
| 109 | +## How a built-in ukernel works end to end |
| 110 | + |
| 111 | +This walks through the `inner_tiled` path; an analogous flow would apply to |
| 112 | +any future op the framework grows to cover. |
| 113 | + |
| 114 | +1. **Source.** A C file under this directory implements the ukernel as a |
| 115 | + single `__attribute__((always_inline))` function. For `inner_tiled`, the |
| 116 | + function takes tile pointers + the `intrinsics_{m,n,k}` unrolling factors |
| 117 | + as scalar arguments, and emits the unrolled inner K-loop. The file does |
| 118 | + `#include "common.h"` for no-stdlib stdint replacements and pulls in |
| 119 | + architecture-specific intrinsic headers (e.g. `<immintrin.h>` for x86). |
| 120 | + |
| 121 | +2. **Build.** A bazel/CMake rule (`iree_bitcode_library` with `ARCH=x86_64` |
| 122 | + and feature-specific `COPTS` like `-mavx512bf16`) compiles the C file |
| 123 | + to an LLVM bitcode file, e.g. |
| 124 | + `iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.x86_64_avx512bf16.bc`. |
| 125 | + |
| 126 | +3. **Embed.** `iree_c_embed_data` collects all such `.bc` files into a |
| 127 | + single TOC blob (`iree_uk_cpu_bitcode.{c,h}`), which is linked into |
| 128 | + `iree-compile`. |
| 129 | + |
| 130 | +4. **Register at compile time.** On LLVMCPU target init, the TOC is |
| 131 | + iterated and the bitcode files are inserted into the global |
| 132 | + `EmbeddedDataDirectory`. |
| 133 | + |
| 134 | +5. **Match.** During LLVMCPU's kernel configuration step |
| 135 | + (`LLVMCPUSelectUKernels`), an `iree_codegen.inner_tiled` op carrying a |
| 136 | + `#iree_cpu.data_tiled_mma_layout` is matched against the available |
| 137 | + ukernels by intrinsic + element types + arch + features. If a match is |
| 138 | + found, the op gets a `iree_codegen.ukernel = "<name>"` descriptor |
| 139 | + attribute set on it, and the matching bitcode is attached. |
| 140 | + |
| 141 | +6. **Lower.** The generic pass |
| 142 | + `iree-codegen-lower-bitcode-ukernels` finds annotated ops, looks up the |
| 143 | + `iree_codegen.ukernel_provider` (a |
| 144 | + `#iree_cpu.ukernel_provider` attribute installed on the target config), |
| 145 | + and calls its `createAndReplaceWithUkernelOp`. That method: |
| 146 | + - Looks up the matching `hal.executable_object` (first in the |
| 147 | + dispatch's existing `hal.executable.objects`, then in the global |
| 148 | + `EmbeddedDataDirectory`). |
| 149 | + - Attaches it to the executable variant. |
| 150 | + - Replaces the op with an `iree_codegen.ukernel.generic` call to the |
| 151 | + ukernel, with `fn_def_attrs = {hal.import.bitcode = true}` so the |
| 152 | + eventual call resolves directly against the linked bitcode rather |
| 153 | + than through the runtime import table. |
| 154 | + |
| 155 | +7. **Lower again.** The existing |
| 156 | + `iree-codegen-lower-ukernel-ops-to-calls` pass lowers |
| 157 | + `ukernel.generic` to a `func.call` against the (now-linked) bitcode |
| 158 | + function. The LLVM `always_inline` attribute then specializes the |
| 159 | + ukernel body into each call site at the LLVM optimization stage. |
| 160 | + |
| 161 | +## How to use as an IREE end user |
| 162 | + |
| 163 | +Pass `--iree-llvmcpu-enable-llvm-ukernels=inner_tiled` to `iree-compile`. |
| 164 | +The flag takes a comma-separated list of categories to enable (currently |
| 165 | +just `inner_tiled`; the name leaves room for future categories). When a |
| 166 | +category is enabled, any matching op implemented by one of the built-in |
| 167 | +ukernels listed below will be rewritten to a ukernel call, and the bitcode |
| 168 | +will appear in the resulting MLIR as a `hal.executable_object` on the |
| 169 | +executable variant. |
| 170 | + |
| 171 | +## How to bring your own ukernel |
| 172 | + |
| 173 | +Attach a `hal.executable_object` to the input MLIR carrying your bitcode |
| 174 | +(with a function whose name matches a known ukernel name). When |
| 175 | +`iree-compile` walks up looking for executable objects, it finds your one |
| 176 | +first and uses it in place of the built-in. Built-in ukernels are |
| 177 | +overridden seamlessly; new ukernels (functions IREE has no built-in for) |
| 178 | +work the same way as long as the user-supplied IR also annotates the |
| 179 | +matched op with `iree_codegen.ukernel = "<name>"`. |
| 180 | + |
| 181 | +## How to author a new built-in ukernel |
| 182 | + |
| 183 | +1. Write the C source under this directory. Function name (and file name) |
| 184 | + is the corresponding `MMAIntrinsic` enum value, lowercased, with the |
| 185 | + `iree_uk_` prefix — e.g. `MMA_X86_AVX512BF16_1x16x2_F32_BF16` becomes |
| 186 | + `iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16`. This mirrors the AMDGPU C |
| 187 | + ukernels: those carry the AMDGPU intrinsic name (`mfma_i32_16x16x32_i8`, |
| 188 | + etc.) verbatim, and on CPU the arch tag (`x86`, ...) is already part of |
| 189 | + the intrinsic name, so this convention drops in cleanly. One ukernel per |
| 190 | + intrinsic — the `intrinsics_{m,n,k}` unrolling factors are *function |
| 191 | + arguments*, not part of the name (see Scope above for why this fully |
| 192 | + specializes anyway). Body is `always_inline`, no allocations, no stdlib |
| 193 | + (just `common.h` and intrinsic headers). Loops driven by the unrolling |
| 194 | + factors should unroll spontaneously after inlining; if Clang doesn't |
| 195 | + manage on its own, add a targeted `#pragma clang loop unroll(...)`. |
| 196 | +2. Add a `iree_bitcode_library` entry in `BUILD.bazel` / `CMakeLists.txt` |
| 197 | + for the new file, with the appropriate `COPTS` feature flags, and add |
| 198 | + the resulting `.bc` to the embedded TOC. |
| 199 | +3. If the ukernel implements an `MMAIntrinsic` that the codegen pipeline |
| 200 | + does not lower, add a new enum value to `IREECPUEnums.td` and minimal |
| 201 | + metadata (`getRowMajorTilesMNKShape`, `getABCElementTypes`) in |
| 202 | + `IREECPUAttrs.cpp` — *but no `lowerX86…` function*, because the |
| 203 | + ukernel is the sole implementation by design. |
| 204 | +4. Extend `LLVMCPUSelectUKernels.cpp` to recognize the new ukernel by |
| 205 | + intrinsic + element types + arch features. |
| 206 | + |
| 207 | +## How to test |
| 208 | + |
| 209 | +Two complementary levels: |
| 210 | + |
| 211 | +- **Lit tests (compiler IR).** Use the self-contained-IR property: write a |
| 212 | + lit test that carries the ukernel bitcode directly as a |
| 213 | + `hal.executable_object` literal in the MLIR. `iree-opt` running the |
| 214 | + lowering passes can verify the resulting IR without ever invoking the |
| 215 | + bitcode-library build. See the lit tests under `test/` for examples — |
| 216 | + ranging from single-pass checks (`lower_inner_tiled_to_bitcode_ukernel*`, |
| 217 | + `select_ukernel`) up to a full-codegen-pipeline check |
| 218 | + (`e2e_inner_tiled_pipeline`) that drives an `inner_tiled` dispatch all the |
| 219 | + way to a direct `llvm.call` against the ukernel symbol. |
| 220 | + |
| 221 | +- **End-to-end numerical tests (`tests/e2e/matmul`).** For actually running |
| 222 | + a ukernel and checking its results against a reference, add a variant to |
| 223 | + the matmul e2e runner tests under `tests/e2e/matmul/` (the |
| 224 | + `iree_tests_e2e_matmul_*` family), compiling with |
| 225 | + `--iree-llvmcpu-enable-llvm-ukernels=inner_tiled`. This is the level that |
| 226 | + catches a wrong inner loop — a lit test only checks that the *call* is |
| 227 | + emitted, not that the ukernel *computes* the right thing. |
| 228 | + |
| 229 | +## Seed examples |
| 230 | + |
| 231 | +The directory is intentionally **sparse**. Ukernels are no longer the |
| 232 | +majority path under the modernized CPU codegen; codegen handles most |
| 233 | +cases on its own. The seeds here illustrate two categories that justify |
| 234 | +a ukernel at all: |
| 235 | + |
| 236 | +| Seed | What it shows | |
| 237 | +|---|---| |
| 238 | +| `iree_uk_mma_x86_avx512bf16_1x16x2_f32_bf16.c` | Clean example. Codegen does the same thing at parity; this is the canonical "simplest new-style ukernel" reference for new authors. | |
| 239 | +| `iree_uk_mma_x86_avx512vnni_16x16x2_i32_i8_casti16.c` | Practically useful. i8×i8→i32 via VNNI is a workhorse for quantized inference, and codegen has a residual perf gap on this case. | |
| 240 | + |
| 241 | +This README is kept up to date as new seeds and framework pieces land. |
0 commit comments