中文版:
README.zh-CN.md
cuPDLP-C-ROCm is a ROCm/HIP port and validation fork of upstream cuPDLP-C for AMD GPUs/APUs. The project keeps the CPU path and upstream-compatible CUDA path, and adds a ROCm/HIP backend for AMD Radeon-class hardware.
| Item | Current value |
|---|---|
| Primary ROCm target | AMD Radeon 890M |
| ROCm architecture | gfx1150 |
| ROCm version used in local validation | 7.2.1 |
| Planned larger AMD target | AMD Radeon PRO W7900 / gfx1100 |
| CUDA baseline devices | RTX 3090, RTX 4090D, H100 |
Status: experimental but buildable. The ROCm/HIP backend has passed smoke validation and a cross-device Netlib benchmark matrix on AMD Radeon 890M /
gfx1150. It is not yet a production-ready, broadly certified, or fully tuned ROCm solver release.
| English | Chinese | Purpose |
|---|---|---|
docs/ROCM_WORKFLOW.md |
docs/ROCM_WORKFLOW.zh-CN.md |
Daily build, validation, profiling, and benchmark commands |
docs/VALIDATION.md |
docs/VALIDATION.zh-CN.md |
CPU-vs-ROCm validation semantics |
docs/CROSS_DEVICE_BENCHMARKS.md |
docs/CROSS_DEVICE_BENCHMARKS.zh-CN.md |
RTX 3090 / RTX 4090D / Radeon 890M benchmark matrix |
docs/LARGE_MPS_BENCHMARK_PLAN.md |
docs/LARGE_MPS_BENCHMARK_PLAN.zh-CN.md |
H100-sourced large MPS workflow and cross-device plan |
docs/ROCM_PORTING_GUIDE.md |
docs/ROCM_PORTING_GUIDE.zh-CN.md |
CUDA-to-ROCm/HIP migration record |
docs/TUNING_GUIDE_ROCM.md |
docs/TUNING_GUIDE_ROCM.zh-CN.md |
ROCm profiling, completed tuning steps, and future targets |
README_UPSTREAM.md |
- | Original upstream README backup |
- CPU-only cuPDLP-C build path.
- Upstream-compatible CUDA build path for NVIDIA baselines.
- ROCm/HIP backend built from migrated CUDA backend code.
plcexecutable linked against the ROCm/HIP backend.- CPU-vs-ROCm smoke validation scripts.
- Extended Netlib validation cases.
- Cross-device benchmark workflow and summaries for RTX 3090, RTX 4090D, and Radeon 890M.
- Large MPS benchmark workflow based on H100-hosted cases, inventory files, and SHA256 manifests.
rocprofv3profiling workflow and summary helpers.
| Mode | CMake options | Role |
|---|---|---|
| CPU | BUILD_CUDA=OFF, BUILD_ROCM=OFF |
Correctness and portability baseline |
| CUDA | BUILD_CUDA=ON, BUILD_ROCM=OFF |
Upstream-compatible NVIDIA backend and benchmark baseline |
| ROCm/HIP | BUILD_CUDA=OFF, BUILD_ROCM=ON |
AMD Radeon ROCm/HIP target backend |
BUILD_CUDA and BUILD_ROCM must not be enabled at the same time. Use separate build directories such as build-cpu, build-cuda, and build-rocm-plc.
BUILD_HIP may still appear as a legacy compatibility alias in older notes, but the public ROCm option is BUILD_ROCM=ON.
Smoke validation currently passes:
| Case | Source | ROCm result |
|---|---|---|
afiro |
example/afiro.mps |
PASS |
sc50b |
validation/netlib/sc50b.mps |
PASS |
Extended Netlib validation currently reports:
| Case | Result | Notes |
|---|---|---|
afiro |
PASS | Baseline example |
adlittle |
PASS | Relative validation metrics pass |
blend |
PASS | Relative validation metrics pass |
sc50a |
PASS | Relative validation metrics pass |
sc50b |
PASS | Smoke + extended case |
share2b |
INCOMPLETE | Hits current iteration/time limit; not treated as a ROCm port failure |
Cross-device Netlib benchmark summary:
| Device | CPU result | GPU/ROCm result | Exception |
|---|---|---|---|
| RTX 3090 / CUDA | 28/28 OPTIMAL after greenbea 200M supplement |
27/28 OPTIMAL | greenbea CUDA reached solver internal 3600s limit |
| RTX 4090D / CUDA | 28/28 OPTIMAL | 27/28 OPTIMAL | greenbea CUDA hit external 3600s timeout |
| Radeon 890M / ROCm | 28/28 OPTIMAL | 27/28 OPTIMAL | greenbea ROCm hit external 3600s timeout |
Large MPS benchmarking is the current next stage. Raw .mps files stay outside Git. Only inventory files, SHA256 manifests, curated CSV/Markdown summaries, and documentation should be committed.
| Component | Version / value |
|---|---|
| OS | Ubuntu 24.04.x |
| ROCm | 7.2.1 |
| HIP compiler | ROCm Clang 22.0.0 |
| GPU/APU | AMD Radeon 890M |
| GPU architecture | gfx1150 |
| HiGHS | 1.6.0 |
| Build system | CMake + Ninja |
CUDA baselines are run locally on NVIDIA systems such as RTX 3090, RTX 4090D, and H100 using upstream-compatible cuPDLP-C CUDA builds.
cmake -S . -B build-rocm-plc -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_CUDA=OFF \
-DBUILD_ROCM=ON \
-DBUILD_APPS=OFF \
-DBUILD_PYTHON=OFF \
-DBUILD_TESTING=ON \
-DCMAKE_PREFIX_PATH=/opt/rocm \
-DCMAKE_HIP_ARCHITECTURES=gfx1150
cmake --build build-rocm-plc --target plc -j"$(nproc)"Run a smoke example:
./build-rocm-plc/bin/plc \
-fname ./example/afiro.mps \
-out /tmp/afiro_rocm_sum.json \
-nIterLim 200cmake -S . -B build-cpu -G Ninja \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_CUDA=OFF \
-DBUILD_ROCM=OFF \
-DBUILD_HIP=OFF \
-DBUILD_APPS=OFF \
-DBUILD_PYTHON=OFF
cmake --build build-cpu --target plc -j"$(nproc)"./scripts/check_rocm_port.sh
ctest --test-dir build-rocm-plc --output-on-failureExtended validation:
RESULT_ROOT=validation/results/extended_netlib \
./scripts/run_validation.sh validation/cases_extended_netlib.txtNetlib cross-device benchmark uses:
validation/cases_benchmark_200m.txt
nIterLim = 200000000
per-run timeout = 3600s
Radeon 890M run:
CASE_TIMEOUT_SEC=3600 ./scripts/run_benchmark_890m_full.sh
./scripts/summarize_benchmark.pyLarge MPS workflow is documented in docs/LARGE_MPS_BENCHMARK_PLAN.md.
RESULT_ROOT=profiling/results/current ./scripts/profile_rocm_smoke.sh
python3 scripts/summarize_rocm_profile.py \
--input profiling/results/current \
--output profiling/results/current/profile_summary.mdInitial gfx1150 profiling shows that small-case runtime is dominated by many small operations: HIP launch overhead, memory copies, ROCclr copyBuffer dispatches, rocSPARSE SpMV, rocBLAS vector kernels, and custom PDLP update kernels.
rocminfo | grep -E "Name:|Marketing Name|gfx"
rocm_agent_enumeratorThen set the proper architecture, for example:
-DCMAKE_HIP_ARCHITECTURES=gfx1100for AMD Radeon PRO W7900, depending on ROCm support.
User-visible ROCm output and documentation should use ROCm/HIP terminology. Historical migration notes and README_UPSTREAM.md may keep CUDA terminology. Internal compatibility symbols may remain until the C/HIP boundary is refactored safely.
Do not remove compatibility symbols such as cuda_csr_Ax, cuda_csc_ATy, or cuda_alloc_MVbuffer without updating the C/HIP call boundary and validation scripts.