[hardware] feat: add XCCL backend workarounds for Intel XPU#13
Open
kahlun wants to merge 3 commits intoxpu/pr-a0-device-detectionfrom
Open
[hardware] feat: add XCCL backend workarounds for Intel XPU#13kahlun wants to merge 3 commits intoxpu/pr-a0-device-detectionfrom
kahlun wants to merge 3 commits intoxpu/pr-a0-device-detectionfrom
Conversation
- Add is_torch_xpu_available() and is_xpu_available flag - Extend get_device_name() to return "xpu" when XPU is available - Extend get_nccl_backend() to return "xccl" for XPU - Extend get_resource_name() to return "xpu" for Ray resources - Add get_default_attention_implementation() → "eager" for XPU (flash_attn package is CUDA-only; XPU uses PyTorch SDPA instead) - Extend get_torch_device() to return torch.xpu namespace - Extend is_support_ipc() to return False for XPU (no SYCL IPC yet) No behavioral change for existing CUDA/NPU paths.
- Composite backend cpu:gloo,xpu:xccl for mixed tensor support - all_reduce_avg(): XCCL lacks ReduceOp.AVG, use SUM + divide - FSDP2 set_force_sum_reduction_for_comms(True) for reduce_scatter - Temporary workarounds — removed when oneCCL adds native AVG
07d47d8 to
f381a03
Compare
4 tasks
f381a03 to
99b68bf
Compare
2e3513a to
201cd50
Compare
F.sdpa dispatches to Intel SYCL-TLA Flash kernel on XPU (10-22x faster). Benchmarked on Arc Pro B60, PyTorch 2.11, bf16.
0097030 to
fa549d1
Compare
kahlun
added a commit
that referenced
this pull request
Apr 16, 2026
Hard blocks fixed: - #1: Replace shell heredoc source-patching with verl/utils/vllm/xpu_patches.py that does import-level monkey-patching; no source modification, version-stable - #2: Remove hardcoded ONEAPI_DEVICE_SELECTOR='level_zero:0,1' from module-level PPO_RAY_RUNTIME_ENV dict; gate propagation to XPU hosts only in get_ppo_ray_runtime_env() - #3: Gate torch.xpu.synchronize() behind VERL_XPU_SYNC_MICROBATCH env var; document the actual root cause (oneCCL non-re-entrancy during FSDP collectives) - #4: Document torch-xpu-ops#3020 in all_gather workaround; add warning at world_size>8 Strong objections fixed: - #6: Add logger.warning in is_support_ipc() when XPU falls back to shared memory - #7: Fix Dockerfile sitecustomize path — use site.getsitepackages() not hardcoded /usr/local/lib/python3.12/ so it works regardless of Python prefix - #8: Add hasattr guard for set_force_sum_reduction_for_comms with fallback warning - #13 (fix #2 side-effect): Restore blank lines in constants_ppo.py Moderate concerns fixed: - #11: Improve list() comment — explain oneCCL non-re-entrancy as actual root cause - #12: Remove numpy<2.0.0 pin (dpctl 0.21.1 does not require it) - #14: Change 'from None' to 'from e' in create_engine_config exception chain
kahlun
added a commit
that referenced
this pull request
Apr 16, 2026
Hard blocks fixed: - #1: Replace shell heredoc source-patching with verl/utils/vllm/xpu_patches.py that does import-level monkey-patching; no source modification, version-stable - #2: Remove hardcoded ONEAPI_DEVICE_SELECTOR='level_zero:0,1' from module-level PPO_RAY_RUNTIME_ENV dict; gate propagation to XPU hosts only in get_ppo_ray_runtime_env() - #3: Gate torch.xpu.synchronize() behind VERL_XPU_SYNC_MICROBATCH env var; document the actual root cause (oneCCL non-re-entrancy during FSDP collectives) - #4: Document torch-xpu-ops#3020 in all_gather workaround; add warning at world_size>8 Strong objections fixed: - #6: Add logger.warning in is_support_ipc() when XPU falls back to shared memory - #7: Fix Dockerfile sitecustomize path — use site.getsitepackages() not hardcoded /usr/local/lib/python3.12/ so it works regardless of Python prefix - #8: Add hasattr guard for set_force_sum_reduction_for_comms with fallback warning - #13 (fix #2 side-effect): Restore blank lines in constants_ppo.py Moderate concerns fixed: - #11: Improve list() comment — explain oneCCL non-re-entrancy as actual root cause - #12: Remove numpy<2.0.0 pin (dpctl 0.21.1 does not require it) - #14: Change 'from None' to 'from e' in create_engine_config exception chain
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Temporary XCCL workarounds — will be removed when oneCCL adds native AVG support.
cpu:gloo,xpu:xcclfor mixed CPU+XPU tensorsall_reduce_avg(): XCCL lacks ReduceOp.AVG → SUM + manual divisionset_force_sum_reduction_for_comms(True)for reduce_scatterDepends on: device detection PR
Test plan
is_xpu_available)