[NPU] Support qwen3-4B dapo full async mode on Ascend NPU by meiqingsui · Pull Request #36 · redai-infra/Relax

meiqingsui · 2026-05-21T03:23:03Z

Update

Replaced GPU placement group logic with NPU‑aware resource allocation.
Integrated HCCL APIs and adapted communication collectives.
Used MindSpeed to bridge Megatron with Ascend NPU.
Patched Sglang from the official repo to support Ascend NPU backend.
Added Dockerfile.npu with all dependencies (MindSpeed, custom Sglang, etc.) and scripts to run Qwen3‑4B GRPO, including accuracy validation steps.

Test

build env:

docker build -f docker/Dockerfile.npu -t tmp-20260522 .

execute scripts:

bash scripts/training/text/run-qwen3-4B-8xgpu-async-npu.sh

200step results compare with A100:

What

Why

How

Testing

pre-commit run --all-files passes
Tests pass (pytest tests/)
New tests added (if applicable)
Documentation updated (if applicable)

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Performance improvement
CI/CD or build changes

Screenshots / Logs

Co-authored-by: yzr<707972058@qq.com>

# 🐛 Bug Fix ## Fix Ray actor resource specification for NPU accelerators - Add `get_ray_accelerator_kwargs()` utility to abstract accelerator-specific Ray resource kwargs - For GPU: use `num_gpus` kwarg instead of `resources={"GPU": N}` - For NPU: use `resources={"NPU": N}` kwarg (Ray does not recognize `num_npus`) - Update all 5 call sites across `service.py`, `actor_group.py`, and `rollout.py` to use the new utility - Remove obsolete `resources={accel_resource: 1}` patterns that were missed in the initial migration ## Details The previous hardcoded `resources={device_utils.get_ray_accelerator_name(): N}` pattern does not work correctly for GPU actors — Ray expects `num_gpus=N` for GPU scheduling but `resources={"NPU": N}` for NPU. This change ensures correct resource allocation regardless of accelerator type.

- Move try/except mindspeed block to correct alphabetical position - Fix import group ordering: import X before from X import - Remove extraneous blank line within LOCALFOLDER relative imports - Add second blank line after imports (lines-after-imports=2 per pyproject.toml) - Sort relax.utils imports alphabetically: device, logging_utils, misc - Add # noqa to mindspeed.megatron_adaptor import

Update the jq filter in ray-job-npu.sh to check for .resources_total.NPU instead of .resources_total.GPU.

…dcoding - Change init_gloo_group to accept distributed_timeout_minutes parameter with default 30 - Pass args.distributed_timeout_minutes from train_actor instead of hardcoding 60 - Align gloo timeout with the existing distributed_timeout_minutes config used elsewhere

fix(scripts): use NPU resource key for master node selection Created-by: Tgz27 Commit-by: Tgz27 Merged-by: leanfy Description: ## What This PR modifies `scripts/entrypoint/ray-job-npu.sh` in two ways: 1. **Fix master address selection** – Change resource filter from GPU to NPU for NPU-based Ray clusters 2. **Add NPU communication environment variables** – Export HCCL socket port ranges before Ray starts to enable NPU collective communication ## Why ### Part 1 – GPU → NPU filter The script is named `ray-job-npu.sh` and targets NPU environments, but internally still uses GPU as the resource criterion. This causes `MASTER_ADDR` to be empty or incorrectly selected in pure NPU clusters, breaking distributed job startup. ### Part 2 – HCCL environment variables NPU-based distributed training requires HCCL (Huawei Collective Communication Library) to function properly. Without explicit socket port ranges, HCCL may fail to initialize or encounter port conflicts, leading to communication failures between Ray workers. ## How **Change 1 – Resource filter (line 64)** Replace `.resources_total.GPU` with `.resources_total.NPU`: ```bash map(select(.state == "ALIVE" and (.resources_total.GPU // 0) > 0)) | → map(select(.state == "ALIVE" and (.resources_total.NPU // 0) > 0)) | ``` **Change 2 – HCCL environment variables (added before Ray start)** Insert the following exports at an appropriate location (e.g., after existing env setup, before `ray start`): ``` # NPU collective communication configuration export HCCL_NPU_SOCKET_PORT_RANGE="62000-62050" export HCCL_HOST_SOCKET_PORT_RANGE="62100-62200" ``` These port ranges are configurable and should not conflict with other services. See merge request: hw-pbclouds/Relax!16

refactor(distributed): parameterize gloo group timeout instead of hardcoding Created-by: hZhang111 Commit-by: hZhang111 Merged-by: gcw_QC397XOa Description: ## What - Add `distributed_timeout_minutes` parameter to `init_gloo_group()` in `relax/utils/distributed_utils.py` - Pass `args.distributed_timeout_minutes` from `train_actor.py` instead of hardcoding the timeout value ## Why The Gloo communication group timeout was hardcoded to 60 minutes, which is inconsistent with the rest of the codebase where `args.distributed_timeout_minutes` is already used as the single source of truth for distributed timeouts. Hardcoding makes it impossible to tune the timeout for different cluster sizes without modifying source code. ## How Parameterize the Gloo group timeout through the existing `distributed_timeout_minutes` config: | Before | After | |--------|-------| | `init_gloo_group()` with hardcoded `timeout=timedelta(minutes=60)` | `init_gloo_group(distributed_timeout_minutes=30)` with `timeout=timedelta(minutes=distributed_timeout_minutes)` | | No argument passed from caller | `init_gloo_group(distributed_timeout_minutes=args.distributed_timeout_minutes)` | Updated call site: | File | Change | |------|--------| | `relax/utils/distributed_utils.py` | Add `distributed_timeout_minutes: int = 30` parameter to `init_gloo_group`, use it for `dist.new_group` timeout | | `relax/distributed/ray/train_actor.py` | Pass `distributed_timeout_minutes=args.distributed_timeout_minutes` to `init_gloo_group` | ## Testing - [ ] `pre-commit run --all-files` passes - [ ] Tests pass (`pytest tests/`) - [ ] New tests added (if applicable) - [ ] Documentation updated (if applicable) ## Type of Change - [ ] Bug fix (non-breaking change that fixes an issue) - [ ] New feature (non-breaking change that adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation update - [x] Refactoring (no functional changes) - [ ] Performance improvement - [ ] CI/CD or build changes ## Screenshots / Logs N/A — no user-facing changes. See merge request: hw-pbclouds/Relax!21

fix(ray): abstract accelerator resource kwargs for NPU compatibility Created-by: gcw_QC397XOa Commit-by: huangxudong Merged-by: gcw_QC397XOa Description: ## What - Add `get_ray_accelerator_kwargs()` utility function in `relax/utils/utils.py` - Replace all hardcoded `resources={accelerator_name: N}` patterns with the new utility across 5 Ray actor call sites ## Why Ray expects the dedicated `num_gpus=N` kwarg for GPU scheduling, not `resources={"GPU": N}`. GPU actors with the wrong resource spec may fail to schedule or be silently ignored by Ray. ## How Add a single abstraction point that returns the correct Ray kwargs per accelerator type: Accelerator | Return Value | |-------------|-------------| | GPU | `{"num_gpus": N}` | | NPU | `{"resources": {"NPU": N}}` | Updated all 5 call sites across 3 files: | File | Location | |------|----------| | `relax/core/service.py` | `create_placement_group` — InfoActor | | `relax/distributed/ray/actor_group.py` | `RayTrainGroup` — TrainRayActor | | `relax/distributed/ray/rollout.py` | `EngineGroup` — RolloutRayActor (SGLang) | | `relax/distributed/ray/rollout.py` | `RolloutManager._scale_out_replica` — InfoActor | | `relax/distributed/ray/rollout.py` | `RolloutManager._create_external_engine` — proxy actor | ## Testing  - [ ] `pre-commit run --all-files` passes - [ ] Tests pass (`pytest tests/`) - [ ] New tests added (if applicable) - [ ] Documentation updated (if applicable) ## Type of Change - [x] Bug fix (non-breaking change that fixes an issue) - [ ] New feature (non-breaking change that adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation update - [ ] Refactoring (no functional changes) - [ ] Performance improvement - [ ] CI/CD or build changes ## Screenshots / Logs N/A — no user-facing changes. See merge request: hw-pbclouds/Relax!17

Yangruipis · 2026-06-01T11:13:51Z

LGTM! Many thanks to the Huawei team!

style: apply pre-commit formatting fixes - Fix import ordering (ruff/isort) - Add trailing commas for multi-line kwargs (ruff format) - Remove unused import (device_utils in actor_group.py) - Fix blank line spacing per PEP8 - Fix end-of-file newline in Dockerfile.npu Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

style: apply pre-commit formatting fixes Created-by: gcw_QC397XOa Commit-by: huangxudong Merged-by: hZhang111 Description: ## What Apply pre-commit auto-fixes to 8 files to meet code formatting standards. All changes are cosmetic — no logic or behavior is altered. **8 files changed, +13/-13** ### Changes by hook | Hook | Files | Change | |------|-------|--------| | **ruff format** | `service.py`, `actor_group.py`, `rollout.py` | Add trailing commas after `**accelerator_kwargs` in multi-line function calls | | **ruff / isort** | `rollout.py`, `utils.py` | Reorder imports: `from relax.utils...` moves to correct alphabetical position | | **ruff / isort** | `actor_group.py` | Remove unused import `from relax.utils import device as device_utils` | | **ruff / pycodestyle** | `__init__.py`, `actor.py`, `device.py`, `utils.py` | Add blank lines between import groups and function definitions per PEP8 (E302) | | **end-of-file-fixer** | `Dockerfile.npu` | Remove trailing blank line at end of file | ## Why `pre-commit run --all-files` flagged these formatting issues. Keeping the codebase clean ensures CI passes and reduces future diff noise. ## Testing - [x] `pre-commit run --all-files` — all 14 hooks pass - [x] No functional changes — `git diff --ignore-all-space` produces empty output ## Type of Change - [x] Refactoring (no functional changes — style only) See merge request: hw-pbclouds/Relax!25

fix(docker): pin sgl-kernel-npu to tagged release 2026.04.15.rc4 style(docker): remove trailing blank line at EOF

fix(docker): pin sgl-kernel-npu to tagged release 2026.04.15.rc4 Created-by: hZhang111 Commit-by: hZhang111 Merged-by: gcw_QC397XOa Description: ## What - Pin `sgl-kernel-npu` to tagged release `2026.04.15.rc4` in `docker/Dockerfile.npu` instead of building from the default `main` branch ## Why Building from the `main` branch of `sgl-kernel-npu` can pull in unstable or incompatible changes. Pinning to a specific tagged release ensures reproducible and deterministic Docker image builds for NPU environments. ## How Add `git checkout 2026.04.15.rc4` after cloning `sgl-kernel-npu` and before running `bash build.sh`, so the build always uses the pinned version. See merge request: hw-pbclouds/Relax!24

docs(npu): add NPU training guide for Ascend 910C # 📝 Documentation ## Add NPU training guide with Docker setup and launch instructions - Document Ascend 910C hardware requirements and driver/firmware versions - Add model support table with Qwen3-4B GRPO async training config - Include environment setup steps (Docker build, NPU driver check) - Document container launch command with device and volume mappings - Add training launch scripts and parameter explanations (PERF_ARGS, OPTIMIZER_ARGS, SGLANG_ARGS, MISC_ARGS)

docs(npu): add NPU training guide for Ascend 910C Created-by: dabuliu123 Commit-by: root Merged-by: hZhang111 Description: ## What  ## Why  ## How  ## Testing  - [ ] `pre-commit run --all-files` passes - [ ] Tests pass (`pytest tests/`) - [ ] New tests added (if applicable) - [ ] Documentation updated (if applicable) ## Type of Change - [ ] Bug fix (non-breaking change that fixes an issue) - [ ] New feature (non-breaking change that adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation update - [ ] Refactoring (no functional changes) - [ ] Performance improvement - [ ] CI/CD or build changes ## Screenshots / Logs  See merge request: hw-pbclouds/Relax!26

docs(npu): add NPU training guide for Ascend 910C # 📝 Documentation ## Add NPU training guide with Docker setup and launch instructions - Document Ascend 910C hardware requirements and driver/firmware versions - Add model support table with Qwen3-4B GRPO async training config - Include environment setup steps (Docker build, NPU driver check) - Document container launch command with device and volume mappings - Add training launch scripts and parameter explanations (PERF_ARGS, OPTIMIZER_ARGS, SGLANG_ARGS, MISC_ARGS) docs(npu): update training scenario to DAPO and improve formatting # 📝 Documentation ## Update NPU training guide - Change training scenario from GRPO to DAPO for Qwen3-4B - Add blank lines between parameter description items for better readability

docs(npu): update training scenario to DAPO and improve formatting Created-by: dabuliu123 Commit-by: dabuliu123;root Merged-by: dabuliu123 Description: ## What  ## Why  ## How  ## Testing  - [ ] `pre-commit run --all-files` passes - [ ] Tests pass (`pytest tests/`) - [ ] New tests added (if applicable) - [ ] Documentation updated (if applicable) ## Type of Change - [ ] Bug fix (non-breaking change that fixes an issue) - [ ] New feature (non-breaking change that adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation update - [ ] Refactoring (no functional changes) - [ ] Performance improvement - [ ] CI/CD or build changes ## Screenshots / Logs  See merge request: hw-pbclouds/Relax!27

NINGBENZHE

LGTM

meiqingsui requested review from Aurelius84, NINGBENZHE and Yangruipis as code owners May 21, 2026 03:23

meiqingsui force-pushed the ascend-dev-0514 branch 5 times, most recently from e558a75 to f51fb3d Compare May 25, 2026 03:28

feat(npu): support npu

2143f14

Co-authored-by: yzr<707972058@qq.com>

meiqingsui force-pushed the ascend-dev-0514 branch from f51fb3d to 2143f14 Compare May 25, 2026 05:55

hbamboo and others added 7 commits May 28, 2026 10:00

fix(scripts): use NPU resource key for master node selection

b2b6cdc

Update the jq filter in ray-job-npu.sh to check for .resources_total.NPU instead of .resources_total.GPU.

Yangruipis previously approved these changes Jun 1, 2026

View reviewed changes

meiqingsui dismissed Yangruipis’s stale review via 56be0a2 June 1, 2026 15:00

meiqingsui force-pushed the ascend-dev-0514 branch from 08cfef6 to 56be0a2 Compare June 1, 2026 15:00

hZhang111 and others added 8 commits June 3, 2026 15:01

fix(docker): pin sgl-kernel-npu to tagged release 2026.04.15.rc4

d4df857

fix(docker): pin sgl-kernel-npu to tagged release 2026.04.15.rc4 style(docker): remove trailing blank line at EOF

meiqingsui changed the title ~~[NPU] Support qwen3-4B grpo full async mode on Ascend NPU~~ [NPU] Support qwen3-4B dapo full async mode on Ascend NPU Jun 3, 2026

Merge branch 'main' into ascend-dev-0514

e38d4fa

NINGBENZHE approved these changes Jun 3, 2026

View reviewed changes

NINGBENZHE merged commit 896e183 into redai-infra:main Jun 3, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU] Support qwen3-4B dapo full async mode on Ascend NPU#36

[NPU] Support qwen3-4B dapo full async mode on Ascend NPU#36
NINGBENZHE merged 17 commits into
redai-infra:mainfrom
meiqingsui:ascend-dev-0514

meiqingsui commented May 21, 2026 •

edited

Loading

Uh oh!

Yangruipis commented Jun 1, 2026

Uh oh!

NINGBENZHE left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

meiqingsui commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update

Test

What

Why

How

Testing

Type of Change

Screenshots / Logs

Uh oh!

Yangruipis commented Jun 1, 2026

Uh oh!

NINGBENZHE left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

meiqingsui commented May 21, 2026 •

edited

Loading