Skip to content

[hardware] feat: add XCCL backend workarounds for Intel XPU#13

Open
kahlun wants to merge 3 commits intoxpu/pr-a0-device-detectionfrom
xpu/pr-a1-xccl-workarounds
Open

[hardware] feat: add XCCL backend workarounds for Intel XPU#13
kahlun wants to merge 3 commits intoxpu/pr-a0-device-detectionfrom
xpu/pr-a1-xccl-workarounds

Conversation

@kahlun
Copy link
Copy Markdown
Owner

@kahlun kahlun commented Apr 7, 2026

Summary

Temporary XCCL workarounds — will be removed when oneCCL adds native AVG support.

  • Composite backend cpu:gloo,xpu:xccl for mixed CPU+XPU tensors
  • all_reduce_avg(): XCCL lacks ReduceOp.AVG → SUM + manual division
  • FSDP2 set_force_sum_reduction_for_comms(True) for reduce_scatter

Depends on: device detection PR

Test plan

  • No change when XPU not available (workarounds gated by is_xpu_available)

kahlun added 2 commits April 7, 2026 03:56
- Add is_torch_xpu_available() and is_xpu_available flag
- Extend get_device_name() to return "xpu" when XPU is available
- Extend get_nccl_backend() to return "xccl" for XPU
- Extend get_resource_name() to return "xpu" for Ray resources
- Add get_default_attention_implementation() → "eager" for XPU
  (flash_attn package is CUDA-only; XPU uses PyTorch SDPA instead)
- Extend get_torch_device() to return torch.xpu namespace
- Extend is_support_ipc() to return False for XPU (no SYCL IPC yet)

No behavioral change for existing CUDA/NPU paths.
- Composite backend cpu:gloo,xpu:xccl for mixed tensor support
- all_reduce_avg(): XCCL lacks ReduceOp.AVG, use SUM + divide
- FSDP2 set_force_sum_reduction_for_comms(True) for reduce_scatter
- Temporary workarounds — removed when oneCCL adds native AVG
@kahlun kahlun force-pushed the xpu/pr-a0-device-detection branch from 07d47d8 to f381a03 Compare April 7, 2026 11:08
@kahlun kahlun force-pushed the xpu/pr-a0-device-detection branch from f381a03 to 99b68bf Compare April 7, 2026 11:09
@kahlun kahlun force-pushed the xpu/pr-a1-xccl-workarounds branch from 2e3513a to 201cd50 Compare April 7, 2026 11:09
F.sdpa dispatches to Intel SYCL-TLA Flash kernel on XPU (10-22x faster).
Benchmarked on Arc Pro B60, PyTorch 2.11, bf16.
@kahlun kahlun force-pushed the xpu/pr-a0-device-detection branch 6 times, most recently from 0097030 to fa549d1 Compare April 10, 2026 04:41
kahlun added a commit that referenced this pull request Apr 16, 2026
Hard blocks fixed:
- #1: Replace shell heredoc source-patching with verl/utils/vllm/xpu_patches.py
  that does import-level monkey-patching; no source modification, version-stable
- #2: Remove hardcoded ONEAPI_DEVICE_SELECTOR='level_zero:0,1' from module-level
  PPO_RAY_RUNTIME_ENV dict; gate propagation to XPU hosts only in get_ppo_ray_runtime_env()
- #3: Gate torch.xpu.synchronize() behind VERL_XPU_SYNC_MICROBATCH env var;
  document the actual root cause (oneCCL non-re-entrancy during FSDP collectives)
- #4: Document torch-xpu-ops#3020 in all_gather workaround; add warning at world_size>8

Strong objections fixed:
- #6: Add logger.warning in is_support_ipc() when XPU falls back to shared memory
- #7: Fix Dockerfile sitecustomize path — use site.getsitepackages() not hardcoded
  /usr/local/lib/python3.12/ so it works regardless of Python prefix
- #8: Add hasattr guard for set_force_sum_reduction_for_comms with fallback warning
- #13 (fix #2 side-effect): Restore blank lines in constants_ppo.py

Moderate concerns fixed:
- #11: Improve list() comment — explain oneCCL non-re-entrancy as actual root cause
- #12: Remove numpy<2.0.0 pin (dpctl 0.21.1 does not require it)
- #14: Change 'from None' to 'from e' in create_engine_config exception chain
kahlun added a commit that referenced this pull request Apr 16, 2026
Hard blocks fixed:
- #1: Replace shell heredoc source-patching with verl/utils/vllm/xpu_patches.py
  that does import-level monkey-patching; no source modification, version-stable
- #2: Remove hardcoded ONEAPI_DEVICE_SELECTOR='level_zero:0,1' from module-level
  PPO_RAY_RUNTIME_ENV dict; gate propagation to XPU hosts only in get_ppo_ray_runtime_env()
- #3: Gate torch.xpu.synchronize() behind VERL_XPU_SYNC_MICROBATCH env var;
  document the actual root cause (oneCCL non-re-entrancy during FSDP collectives)
- #4: Document torch-xpu-ops#3020 in all_gather workaround; add warning at world_size>8

Strong objections fixed:
- #6: Add logger.warning in is_support_ipc() when XPU falls back to shared memory
- #7: Fix Dockerfile sitecustomize path — use site.getsitepackages() not hardcoded
  /usr/local/lib/python3.12/ so it works regardless of Python prefix
- #8: Add hasattr guard for set_force_sum_reduction_for_comms with fallback warning
- #13 (fix #2 side-effect): Restore blank lines in constants_ppo.py

Moderate concerns fixed:
- #11: Improve list() comment — explain oneCCL non-re-entrancy as actual root cause
- #12: Remove numpy<2.0.0 pin (dpctl 0.21.1 does not require it)
- #14: Change 'from None' to 'from e' in create_engine_config exception chain
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant