Skip to content

[amdgpu] LLVM 20 updates for AMD MI3xx GPUs#8793

Open
tmm77 wants to merge 53 commits intotaichi-dev:masterfrom
ROCm:amd-integration
Open

[amdgpu] LLVM 20 updates for AMD MI3xx GPUs#8793
tmm77 wants to merge 53 commits intotaichi-dev:masterfrom
ROCm:amd-integration

Conversation

@tmm77
Copy link
Copy Markdown

@tmm77 tmm77 commented Apr 15, 2026

Issue: #

Brief Summary

These code changes update LLVM to version 20 for AMD GPU code generation to enable Taichi on MI300X, MI325X, and MI355X.


Note

High Risk
High risk because it changes LLVM integration across AMDGPU/CUDA/CPU/DX12 backends (pass pipelines, pointer types, intrinsics), which can affect code generation correctness and runtime stability across platforms.

Overview
Updates build and CI tooling to prefer Clang/LLVM 20 (including Linux compiler discovery) and adjusts the build scripts to use system-provided LLVM/CUDA paths rather than always downloading prebuilts.

Modernizes multiple backends for LLVM 16–20 compatibility: switches CPU/CUDA/AMDGPU/DX12 codegen and JIT paths to the New Pass Manager, adapts to removed/renamed LLVM headers/APIs, replaces CUDA nvvm_ldg intrinsics with an address-space load + !invariant.load metadata, and updates various pointer casts toward opaque pointers.

Adds new math ops erf/erfc end-to-end (IR builder, expression ops, LLVM/CUDA codegen, Python API exports), introduces a ROCm multi-stage Dockerfile.rocm plus ReadTheDocs/Sphinx docs for ROCm-Simulation packaging, and tweaks microbenchmarks to support amdgpu and CLI-selected plans.

Reviewed by Cursor Bugbot for commit 440fcc2. Bugbot is set up for automated code reviews on this repo. Configure here.

tmm77 and others added 30 commits April 29, 2025 17:14
Parameterize microbenchmarks and vulkan sdk update
fix: Patch to avoid the need to fetch source to build Taichi wheel
Taichi Dockerfile
Co-authored-by: Bhavesh Lad <Bhavesh.Lad@amd.com>
Co-authored-by: Tiffany Mintz <tiffany.mintz@amd.com>
…TX handling, and implement new pass manager setup
 from johnnynunez/taichi master branch; some of the changes from these were captured in the previous commit to rocm/taichi
// but to insert passes in the middle, we construct it manually. A simpler way is to
// use `parsePassPipeline`. For now, we build the default pipeline first.
if (config.opt_level > 0) {
MPM = PB.buildPerModuleDefaultPipeline(opt_level);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DX12 intrinsic lowering pass lost on reassignment

High Severity

When config.opt_level > 0, MPM is reassigned via MPM = PB.buildPerModuleDefaultPipeline(opt_level), which completely discards the previously added createTaichiIntrinsicLowerPass. The original code added this pass first, then populated optimization passes on the same manager. Now the intrinsic lowering pass never runs for DX12 when optimizations are enabled.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f47d1b8. Configure here.


machine_gen_gcn->registerPassBuilderCallbacks(module_gen_gcn_pass_manager);

builder.run(*module_clone, MAM);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMDGPU GCN output empty for LLVM 17+

Medium Severity

In the print_kernel_amdgcn path for LLVM_VERSION_MAJOR >= 17, the code sets up a new pass manager and runs optimization passes on the cloned module, but never calls addPassesToEmitFile to write assembly to llvm_stream_gcn. The gcnstr buffer remains empty, so the written GCN file will contain no content. The legacy path correctly emits assembly via addPassesToEmitFile.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f47d1b8. Configure here.

if ((u.system, u.machine) not in (("Linux", "arm64"), ("Linux", "aarch64"))) and not (cmake_args.get_effective("TI_WITH_AMDGPU")):
os.environ["LLVM_DIR"] = "/usr/lib/llvm-20/cmake"
os.environ["CUDA_HOME"] = "/usr/local/cuda"
os.environ["CPATH"] = "/usr/local/cuda/include"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLVM_DIR hardcoded to Linux path for all platforms

Medium Severity

The final LLVM_DIR assignment unconditionally sets it to /usr/lib/llvm-20/cmake for all non-ARM-Linux, non-AMDGPU platforms, including macOS and Windows. The original code used str(out) which pointed to the platform-specific downloaded LLVM path. This overwrites the correct out-based paths for Darwin and Windows, breaking LLVM discovery on those platforms. Similarly, CUDA_HOME and CPATH are set to Linux-specific paths.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f47d1b8. Configure here.

Comment thread docs/conf.py
f.read())
if not match:
raise ValueError("VERSION not found!")
version_number = match[1]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs conf.py searches for nonexistent CMake function

Medium Severity

The docs/conf.py searches for rocm_setup_version(VERSION ...) in CMakeLists.txt, but the project's CMakeLists.txt does not contain this function call. This causes a ValueError("VERSION not found!") to be raised every time the documentation is built, completely breaking the docs build pipeline.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit f47d1b8. Configure here.

@tmm77 tmm77 changed the title LLVM 20 updates for AMD MI3xx GPUs [amdgpu] LLVM 20 updates for AMD MI3xx GPUs Apr 16, 2026
This is to address AMD security concerns
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 5 total unresolved issues (including 4 from previous reviews).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 440fcc2. Configure here.

llvm::ModulePassManager builder =
module_pass_manager.buildPerModuleDefaultPipeline(llvm::OptimizationLevel::O3);

machine->registerPassBuilderCallbacks(module_pass_manager);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMDGPU target callbacks registered after pipeline is built

High Severity

machine->registerPassBuilderCallbacks() is called after buildPerModuleDefaultPipeline(), so AMDGPU target-specific passes are never included in the optimization pipeline. Both the CPU (codegen_cpu.cpp:311) and CUDA (jit_cuda.cpp:201) implementations correctly call registerPassBuilderCallbacks before building the pipeline. This same ordering mistake occurs twice — in the GCN printing path and the main optimization path.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 440fcc2. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants