[hardware] feat: add Intel XPU device support#695
[hardware] feat: add Intel XPU device support#695kahlun wants to merge 4 commits intoByteDance-Seed:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for XPU devices across various modules, including device detection, communication backends, and synchronization utilities. Feedback indicates that the new guard in get_device_key incorrectly returns "unknown" for XPU devices, bypassing device-specific configurations. Furthermore, the conditional import of group_gemm kernels in moe_layer.py is likely to cause a NameError on XPU systems when those kernels are referenced at runtime.
| if not torch.cuda.is_available(): | ||
| return "unknown" |
There was a problem hiding this comment.
This guard returns "unknown" for XPU devices, which prevents the function from reaching the get_device_name() call at the end of the function. This will cause MoE kernels to use fallback configurations instead of device-specific ones. Since this PR aims to add XPU support, the logic should be updated to allow XPU devices to fall through to the device name detection while still guarding the CUDA-specific get_device_capability() calls to avoid crashes.
|
|
||
|
|
||
| if not is_torch_npu_available(): | ||
| if not is_torch_npu_available() and torch.cuda.is_available(): |
There was a problem hiding this comment.
Guarding the import of group_gemm kernels with torch.cuda.is_available() will cause a NameError at runtime on XPU devices when EPGroupGemm or EPMergedFc1GroupGemm are used, as these classes refer to the imported functions in their methods. If MoE is not yet supported on XPU, it would be better to provide a clear error message (e.g., a RuntimeError in the forward method) or ensure the functions are defined as None and checked before use, rather than allowing a NameError to occur.
- Add IS_XPU_AVAILABLE flag, XPU branch in get_device_type(), get_dist_comm_backend() (xccl), and stream_synchronize() - Guard torch.cuda.get_device_capability() for non-CUDA devices; let XPU fall through to get_device_name() instead of returning 'unknown' - Guard group_gemm import (CUDA-only Triton kernels) with None defaults to avoid NameError on non-CUDA devices - Accept 'xpu' as valid init_device in non-FSDP path Tested on Intel Arc Pro B60 (Battlemage BMG-G21, 24 GB VRAM): - 1-GPU standalone: 7/7 pass - 2-GPU FSDP2: 8/8 pass (with CCL_ATL_SHM=1) - veRL e2e GRPO (VeOmni engine + vLLM rollout): PASS - Model: Qwen2.5-0.5B-Instruct (494M params, bf16)
36b72a1 to
b212058
Compare
- device.py: keep explicit get_device_key fallback for non-CUDA devices - moe_layer.py: raise actionable RuntimeError when fused MoE group_gemm is unavailable (e.g., XPU), guiding users to moe_implementation=eager - tests/special_xpu/test_fsdp2_simple_xpu.py: rewritten to true FSDP2 via fully_shard (no FSDPv1 API) - tests/special_xpu/test_fsdp2_simple_xpu.py: removed dead train_step helper - tests/special_xpu/run_veomni_e2e_sft_xpu.sh: clarify this trainer path is fsdp1 smoke; FSDP2 coverage comes from test_fsdp2_simple_xpu.py Validated: - 2-GPU XPU FSDP2 smoke passes with fully_shard (loss decreases) - FSDP2 wrappers present at runtime (FSDPModule count > 0)
2c68ee9 to
f893cad
Compare
- Add 'xpu' to EngineRegistry.register device list for VeOmniEngineWithLMHead - Add GRPO VeOmni XPU e2e test script (tests/special_xpu/run_grpo_veomni_xpu.sh) Depends on upstream VeOmni XPU patches: ByteDance-Seed/VeOmni#695
- Add 'xpu' to EngineRegistry.register device list for VeOmniEngineWithLMHead - Add GRPO VeOmni XPU e2e test script (tests/special_xpu/run_grpo_veomni_xpu.sh) Depends on upstream VeOmni XPU patches: ByteDance-Seed/VeOmni#695
Summary
Add Intel XPU (GPU) device detection and backend support to VeOmni, enabling training on Intel Arc/Data Center GPUs.
Changes (4 files, +13 / -3 lines)
veomni/utils/device.pyIS_XPU_AVAILABLEflag, XPU branch inget_device_type(),get_dist_comm_backend()(xccl), andstream_synchronize()veomni/ops/kernels/moe/_kernels/utils/device.pytorch.cuda.get_device_capability()for non-CUDA devicesveomni/distributed/moe/moe_layer.pygroup_gemmimport (Triton CUDA kernels) for CUDA-onlyveomni/distributed/torch_parallelize.py"xpu"as validinit_devicein non-FSDP pathTesting
Tested on Intel Arc Pro B70 (Battlemage BMG-G31, 32 GB VRAM):
CCL_ATL_SHM=1)Context
Follow-up to #648 (
src_data_rank=Nonefix for scatter hang). These are the remaining XPU device detection gaps that block VeOmni from running on Intel GPUs.The corresponding veRL integration PR (1-line
EngineRegistryregistration) is staged separately.