feat: Add npu megatron support by UsernameFull · Pull Request #380 · alibaba/ROLL

UsernameFull · 2026-03-16T01:12:23Z

Summary

This PR adds support for Huawei Ascend NPU devices with Megatron-Core backend, enabling ROLL framework to run reinforcement learning training on NPU hardware.

Key Changes

1. Platform Detection Priority

File: roll/platforms/__init__.py

Changes: Reordered platform detection to check NPU before CUDA.

Reason: NPU devices were incorrectly falling back to CUDA platform. Prioritizing NPU detection ensures NpuPlatform is properly initialized when torch_npu is available.

2. Device-Agnostic Operations

File: roll/pipeline/base_worker.py

Changes:

Replaced "cuda" with current_platform.device_type
Replaced torch.cuda.memory_allocated() with current_platform.memory_allocated()

3. MindSpeed Integration

File: mcore_adapter/src/mcore_adapter/training_args.py

Changes: Added optional import of mindspeed.megatron_adaptor .

Reason: MindSpeed is Huawei's library providing NPU-specific Megatron optimizations. The adaptor patches Megatron-Core for NPU compatibility while maintaining GPU compatibility via try-except.

4. NPU Attention Mask Format

File: roll/distributed/strategy/megatron_strategy.py

Changes: Added NPU-specific attention mask transformation to 4D format.

Reason: NPU requires 4D attention masks [B, 1, S, S] instead of standard 2D [B, S] . This hardware-specific requirement ensures correct attention computation on NPU.

if hasattr(torch, "npu") and torch.npu.is_available() and attention_mask is 
not None:
    attention_mask = attention_mask.bool()
    attention_mask = attention_mask[:, None, None, :].expand(B, 1, S, S)

5. Optimizer Compatibility

File: roll/third_party/megatron/optimizer.py

Changes: Added support for no_weight_decay_cond , scale_lr_cond , lr_mult parameters.

6. Example Configurations

Files:

- examples/ascend_examples/qwen3_4B_dpo_megatron.yaml
- examples/ascend_examples/qwen3_8b_rlvr_deepspeed.yaml
- examples/ascend_examples/run_dpo_pipeline.sh

Reason: Provides ready-to-use NPU training examples demonstrating proper device mapping and strategy configuration for both DPO and RLVR pipelines.

Impact

Benefits:

Enables megatron on Huawei Ascend NPU hardware
Maintains full backward compatibility with GPU systems
Follows existing platform abstraction patterns

Requirements

Huawei Ascend NPU with torch_npu installed
MindSpeed(v0.15.3) library for NPU Megatron support

# Conflicts: # roll/pipeline/sft/sft_pipeline.py

# Conflicts: # roll/configs/worker_config.py feat: fix ascend example fix: ascend rlvr yaml fix fix: megatron fix

UsernameFull force-pushed the npu_megatron branch 2 times, most recently from fb2e7dc to acfad89 Compare March 17, 2026 06:54

UsernameFull changed the title ~~[WIP]Add npu megatron support~~ feat: Add npu megatron support Apr 2, 2026

UsernameFull and others added 2 commits April 2, 2026 14:40

fix: add sft support on npu

c6dcb98

# Conflicts: # roll/pipeline/sft/sft_pipeline.py

adapt mindspeed

26ca43a

# Conflicts: # roll/configs/worker_config.py feat: fix ascend example fix: ascend rlvr yaml fix fix: megatron fix

UsernameFull force-pushed the npu_megatron branch 2 times, most recently from e6d042f to df7d186 Compare April 2, 2026 07:30

fix: add base worker npu support, add npu example yaml

0af5e74

UsernameFull force-pushed the npu_megatron branch from 1e7f794 to 0af5e74 Compare April 2, 2026 08:42

fix: add base worker npu support

2326d75

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add npu megatron support#380

feat: Add npu megatron support#380
UsernameFull wants to merge 4 commits intoalibaba:mainfrom
UsernameFull:npu_megatron

UsernameFull commented Mar 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

UsernameFull commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

1. Platform Detection Priority

2. Device-Agnostic Operations

3. MindSpeed Integration

4. NPU Attention Mask Format

5. Optimizer Compatibility

6. Example Configurations

Impact

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

UsernameFull commented Mar 16, 2026 •

edited

Loading