Skip to content

[hardware] feat: add Intel XPU device support#695

Draft
kahlun wants to merge 4 commits intoByteDance-Seed:mainfrom
kahlun:xpu/device-support
Draft

[hardware] feat: add Intel XPU device support#695
kahlun wants to merge 4 commits intoByteDance-Seed:mainfrom
kahlun:xpu/device-support

Conversation

@kahlun
Copy link
Copy Markdown
Contributor

@kahlun kahlun commented Apr 27, 2026

Summary

Add Intel XPU (GPU) device detection and backend support to VeOmni, enabling training on Intel Arc/Data Center GPUs.

Changes (4 files, +13 / -3 lines)

File Change Bug ID
veomni/utils/device.py Add IS_XPU_AVAILABLE flag, XPU branch in get_device_type(), get_dist_comm_backend() (xccl), and stream_synchronize() B1, B2
veomni/ops/kernels/moe/_kernels/utils/device.py Guard torch.cuda.get_device_capability() for non-CUDA devices B3
veomni/distributed/moe/moe_layer.py Guard group_gemm import (Triton CUDA kernels) for CUDA-only B3
veomni/distributed/torch_parallelize.py Accept "xpu" as valid init_device in non-FSDP path B5

Testing

Tested on Intel Arc Pro B70 (Battlemage BMG-G31, 32 GB VRAM):

  • 1-GPU: 7/7 pass (import, model build, FSDP2 parallelize, optimizer, fwd+bwd, 3-step train loop, CPU offload round-trip)
  • 2-GPU: 8/8 pass (with CCL_ATL_SHM=1)
  • Model: Qwen2.5-0.5B (494M params, bf16)

Context

Follow-up to #648 (src_data_rank=None fix for scatter hang). These are the remaining XPU device detection gaps that block VeOmni from running on Intel GPUs.

The corresponding veRL integration PR (1-line EngineRegistry registration) is staged separately.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for XPU devices across various modules, including device detection, communication backends, and synchronization utilities. Feedback indicates that the new guard in get_device_key incorrectly returns "unknown" for XPU devices, bypassing device-specific configurations. Furthermore, the conditional import of group_gemm kernels in moe_layer.py is likely to cause a NameError on XPU systems when those kernels are referenced at runtime.

Comment on lines +24 to +25
if not torch.cuda.is_available():
return "unknown"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This guard returns "unknown" for XPU devices, which prevents the function from reaching the get_device_name() call at the end of the function. This will cause MoE kernels to use fallback configurations instead of device-specific ones. Since this PR aims to add XPU support, the logic should be updated to allow XPU devices to fall through to the device name detection while still guarding the CUDA-specific get_device_capability() calls to avoid crashes.



if not is_torch_npu_available():
if not is_torch_npu_available() and torch.cuda.is_available():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Guarding the import of group_gemm kernels with torch.cuda.is_available() will cause a NameError at runtime on XPU devices when EPGroupGemm or EPMergedFc1GroupGemm are used, as these classes refer to the imported functions in their methods. If MoE is not yet supported on XPU, it would be better to provide a clear error message (e.g., a RuntimeError in the forward method) or ensure the functions are defined as None and checked before use, rather than allowing a NameError to occur.

- Add IS_XPU_AVAILABLE flag, XPU branch in get_device_type(),
  get_dist_comm_backend() (xccl), and stream_synchronize()
- Guard torch.cuda.get_device_capability() for non-CUDA devices;
  let XPU fall through to get_device_name() instead of returning 'unknown'
- Guard group_gemm import (CUDA-only Triton kernels) with None defaults
  to avoid NameError on non-CUDA devices
- Accept 'xpu' as valid init_device in non-FSDP path

Tested on Intel Arc Pro B60 (Battlemage BMG-G21, 24 GB VRAM):
- 1-GPU standalone: 7/7 pass
- 2-GPU FSDP2: 8/8 pass (with CCL_ATL_SHM=1)
- veRL e2e GRPO (VeOmni engine + vLLM rollout): PASS
- Model: Qwen2.5-0.5B-Instruct (494M params, bf16)
@kahlun kahlun force-pushed the xpu/device-support branch from 36b72a1 to b212058 Compare April 28, 2026 08:28
- device.py: keep explicit get_device_key fallback for non-CUDA devices
- moe_layer.py: raise actionable RuntimeError when fused MoE group_gemm is unavailable (e.g., XPU), guiding users to moe_implementation=eager
- tests/special_xpu/test_fsdp2_simple_xpu.py: rewritten to true FSDP2 via fully_shard (no FSDPv1 API)
- tests/special_xpu/test_fsdp2_simple_xpu.py: removed dead train_step helper
- tests/special_xpu/run_veomni_e2e_sft_xpu.sh: clarify this trainer path is fsdp1 smoke; FSDP2 coverage comes from test_fsdp2_simple_xpu.py

Validated:
- 2-GPU XPU FSDP2 smoke passes with fully_shard (loss decreases)
- FSDP2 wrappers present at runtime (FSDPModule count > 0)
@kahlun kahlun force-pushed the xpu/device-support branch from 2c68ee9 to f893cad Compare April 30, 2026 04:11
kahlun added a commit to kahlun/verl that referenced this pull request Apr 30, 2026
- Add 'xpu' to EngineRegistry.register device list for VeOmniEngineWithLMHead
- Add GRPO VeOmni XPU e2e test script (tests/special_xpu/run_grpo_veomni_xpu.sh)

Depends on upstream VeOmni XPU patches: ByteDance-Seed/VeOmni#695
kahlun added a commit to kahlun/verl that referenced this pull request Apr 30, 2026
- Add 'xpu' to EngineRegistry.register device list for VeOmniEngineWithLMHead
- Add GRPO VeOmni XPU e2e test script (tests/special_xpu/run_grpo_veomni_xpu.sh)

Depends on upstream VeOmni XPU patches: ByteDance-Seed/VeOmni#695
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant