What's Primus-Turbo? | What's New | Primus Product Matrix | Quick Start | Example | Performance | Roadmap | License
Primus-Turbo is a high-performance acceleration library dedicated to large-scale model training on AMD GPUs. Built and optimized for the AMD ROCm platform, it covers the full training stack — including core compute operators (GEMM, Attention, GroupedGEMM), communication primitives, optimizer modules, low-precision computation (FP8), and compute–communication overlap kernels.
With High Performance, Full-Featured, and Developer-Friendly as its guiding principles, Primus-Turbo is designed to fully unleash the potential of AMD GPUs for large-scale training workloads, offering a robust and complete acceleration foundation for next-generation AI systems.
Note: JAX support is under active development. Optim support is planned but not yet available.- [2025/12/16] 🔥MoE Training Best Practices on AMD GPUs
- [2025/12/01] 🔥Efficient MoE Pre-training at Scale on 1K AMD GPUs with TorchTitan.
- [2025/09/19] Primus-Turbo introduction blog.
- [2025/09/11] Primus-Turbo initial release, version v0.1.0.
| Module | Role | Key Features |
|---|---|---|
| Primus-LM | E2E training framework | - Supports multiple training backends (Megatron, TorchTitan, etc.) - Provides high-performance, scalable distributed training - Deeply integrates with Primus-Turbo and Primus-SaFE |
| Primus-Turbo | High-performance operators & modules | - Supports core training operators and modules (FlashAttention, GEMM, GroupedGemm, DeepEP etc.) - Integrates multiple high-performance backends (e.g., CK, hipBLASLt, AITER) - High performance and easy to integrate |
| Primus-SaFE | Stability & platform layer | - Cluster sanity check and benchmarking - Kubernetes scheduling with topology awareness - Fault tolerance - Stability enhancements |
- ROCm >= 7.0
- Python >= 3.10
- PyTorch >= 2.6.0 (with ROCm support)
- AITER (required for some operators, e.g. FlashAttention / FP8):
pip3 install "amd-aiter @ git+https://github.com/ROCm/aiter.git@v0.1.14.post1" - rocSHMEM (optional, required for experimental DeepEP). Please refer to our DeepEP Installation Guide for instructions.
| Architecture | Supported GPUs |
|---|---|
| GFX942 | ✅MI300X, ✅MI325X |
| GFX950 | ✅MI350X, ✅MI355X |
See AMD GPU Architecture to find the architecture for your GPU.
Use the pre-built AMD ROCm image from Docker Hub:
# PyTorch Ecosystem
docker pull rocm/primus:v26.2
# JAX Ecosystem
docker pull rocm/jax-training:maxtext-v26.2You can also use the official ROCm PyTorch image from Docker Hub.
Prerequisite: install inside an environment that already has ROCm PyTorch — e.g. the
rocm/primusimage above, or the officialrocm/pytorchimage. Primus-Turbo builds against your existing torch and does not install torch for you; in a bare environmentpipwould otherwise pull a non-ROCm torch.
# PyTorch backend (latest)
pip3 install --no-build-isolation "primus-turbo[pytorch]" \
--extra-index-url https://amd-agi.github.io/Primus-Turbo/simple/
# Pin a specific version
pip3 install --no-build-isolation "primus-turbo[pytorch]==0.1.0" \
--extra-index-url https://amd-agi.github.io/Primus-Turbo/simple/The index currently serves source distributions (sdist), so install compiles HIP kernels locally (needs the ROCm toolchain; supports gfx942 / gfx950). Prebuilt wheels are planned. Keep
--no-build-isolationso the build uses your preinstalled torch.
git clone https://github.com/AMD-AGI/Primus-Turbo.git
cd Primus-Turbo
# Install build/runtime dependencies first
pip3 install -r requirements.txt
# Default backend: PyTorch
pip3 install --no-build-isolation ".[pytorch]"
# JAX backend
PRIMUS_TURBO_FRAMEWORK="JAX" pip3 install --no-build-isolation ".[jax]"# Install from default branch
pip3 install --no-build-isolation "git+https://github.com/AMD-AGI/Primus-Turbo.git"
# Install from a specific branch
pip3 install --no-build-isolation "git+https://github.com/AMD-AGI/Primus-Turbo.git@main"Note:
".[pytorch]"/".[jax]"means install from current local repo with extras.- Extras select Python dependencies. Source compilation target is controlled by
PRIMUS_TURBO_FRAMEWORK.
For contributors, use editable mode (-e) so that code changes take effect immediately without reinstalling.
git clone https://github.com/AMD-AGI/Primus-Turbo.git
cd Primus-Turbo
pip3 install -r requirements.txt
pip3 install --no-build-isolation -e ".[pytorch]" -v
# (Optional) Set GPU_ARCHS environment variable to specify target AMD GPU architectures.
GPU_ARCHS="gfx942;gfx950" pip3 install --no-build-isolation -e ".[pytorch]" -v
# (Optional) Set PRIMUS_TURBO_FRAMEWORK to compile for a specific framework.
# Supported values: PYTORCH (default), JAX.
# For example, to compile for JAX:
PRIMUS_TURBO_FRAMEWORK="JAX" pip3 install --no-build-isolation -e ".[jax]" -vOption 1: Single-process mode (slow but simple)
pytest tests/pytorch/ # run all PyTorch tests
pytest tests/jax/ # run all JAX testsOption 2: Multi-process mode (faster)
# PyTorch tests
## single-GPU tests (parallel)
pytest tests/pytorch/ -n 8
## deterministic tests (parallel)
pytest tests/pytorch/ -n 8 --deterministic-only
## multi-GPU tests
pytest tests/pytorch/ --dist-only
# JAX tests
## single-GPU tests (parallel)
pytest tests/jax/ -n 8
## multi-GPU tests
pytest tests/jax/ --dist-onlypip installation behavior:
- Use a compatible wheel (
.whl) if available. - Fall back to source distribution (
sdist,.tar.gz) when no wheel matches.
Artifact roles:
- wheel: prebuilt binary package, fast install, no local C++/HIP build.
- sdist: source package, slower install, requires local toolchain, fallback path.
# Build wheel (binary distribution)
python3 -m build --wheel --no-isolation
# Build sdist (source distribution)
python3 -m build --sdist --no-isolationpip3 install --no-build-isolation ./dist/primus_turbo-XXX.whlpip3 install --no-build-isolation ./dist/primus_turbo-XXX.tar.gzTip: Run import checks outside the source tree (for example under
/tmp) to avoid importing local source files by accident.
import torch
import primus_turbo.pytorch as turbo
dtype = torch.bfloat16
device = "cuda:0"
a = torch.randn((128, 256), dtype=dtype, device=device)
b = torch.randn((256, 512), dtype=dtype, device=device)
c = turbo.ops.gemm(a, b)
print(c)
print(c.shape)See Examples for usage examples.
See Benchmarks for detailed performance results and comparisons.
Roadmap: Primus-Turbo Roadmap H1 2026
Primus-Turbo is licensed under the MIT License.
© 2025 Advanced Micro Devices, Inc. All rights reserved.
