[ray, single_controller] feat: add XPU resource mapping for Ray by kahlun · Pull Request #14 · kahlun/verl

kahlun · 2026-04-07T11:07:23Z

Summary

Teach Ray's resource system about Intel XPU devices.

Resource detection: check node_info.get("xpu", 0) alongside GPU/NPU
Placement group: {"xpu": num_gpus} for XPU worker scheduling
Worker local_rank: RANK % LOCAL_WORLD_SIZE when Ray doesn't recognize XPU

Depends on: device detection PR (parallel with XCCL workarounds)

Test plan

Existing Ray tests unaffected (XPU path only activates on XPU hardware)

- Add is_torch_xpu_available() and is_xpu_available flag - Extend get_device_name() to return "xpu" when XPU is available - Extend get_nccl_backend() to return "xccl" for XPU - Extend get_resource_name() to return "xpu" for Ray resources - Add get_default_attention_implementation() → "eager" for XPU (flash_attn package is CUDA-only; XPU uses PyTorch SDPA instead) - Extend get_torch_device() to return torch.xpu namespace - Extend is_support_ipc() to return False for XPU (no SYCL IPC yet) No behavioral change for existing CUDA/NPU paths.

- Resource detection: check xpu custom resource in node info - Placement group: {xpu: num_gpus} for XPU workers - Worker local_rank from RANK % LOCAL_WORLD_SIZE for XPU

F.sdpa dispatches to Intel SYCL-TLA Flash kernel on XPU (10-22x faster). Benchmarked on Arc Pro B60, PyTorch 2.11, bf16.

Hard blocks fixed: - #1: Replace shell heredoc source-patching with verl/utils/vllm/xpu_patches.py that does import-level monkey-patching; no source modification, version-stable - #2: Remove hardcoded ONEAPI_DEVICE_SELECTOR='level_zero:0,1' from module-level PPO_RAY_RUNTIME_ENV dict; gate propagation to XPU hosts only in get_ppo_ray_runtime_env() - #3: Gate torch.xpu.synchronize() behind VERL_XPU_SYNC_MICROBATCH env var; document the actual root cause (oneCCL non-re-entrancy during FSDP collectives) - #4: Document torch-xpu-ops#3020 in all_gather workaround; add warning at world_size>8 Strong objections fixed: - #6: Add logger.warning in is_support_ipc() when XPU falls back to shared memory - #7: Fix Dockerfile sitecustomize path — use site.getsitepackages() not hardcoded /usr/local/lib/python3.12/ so it works regardless of Python prefix - #8: Add hasattr guard for set_force_sum_reduction_for_comms with fallback warning - #13 (fix #2 side-effect): Restore blank lines in constants_ppo.py Moderate concerns fixed: - #11: Improve list() comment — explain oneCCL non-re-entrancy as actual root cause - #12: Remove numpy<2.0.0 pin (dpctl 0.21.1 does not require it) - #14: Change 'from None' to 'from e' in create_engine_config exception chain

kahlun added 2 commits April 7, 2026 03:56

[ray, single_controller] feat: add XPU resource mapping for Ray

6bc7e34

- Resource detection: check xpu custom resource in node info - Placement group: {xpu: num_gpus} for XPU workers - Worker local_rank from RANK % LOCAL_WORLD_SIZE for XPU

kahlun force-pushed the xpu/pr-a0-device-detection branch from 07d47d8 to f381a03 Compare April 7, 2026 11:08

kahlun mentioned this pull request Apr 7, 2026

[hardware] feat: add Intel XPU device abstraction and distributed support #2

Closed

4 tasks

kahlun force-pushed the xpu/pr-a0-device-detection branch from f381a03 to 99b68bf Compare April 7, 2026 11:09

kahlun force-pushed the xpu/pr-a2-ray-integration branch from 961f46d to 6bc7e34 Compare April 7, 2026 11:09

fix: change XPU attention default from "eager" to "sdpa"

3361649

F.sdpa dispatches to Intel SYCL-TLA Flash kernel on XPU (10-22x faster). Benchmarked on Arc Pro B60, PyTorch 2.11, bf16.

kahlun force-pushed the xpu/pr-a0-device-detection branch 6 times, most recently from 0097030 to fa549d1 Compare April 10, 2026 04:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ray, single_controller] feat: add XPU resource mapping for Ray#14

[ray, single_controller] feat: add XPU resource mapping for Ray#14
kahlun wants to merge 3 commits intoxpu/pr-a0-device-detectionfrom
xpu/pr-a2-ray-integration

kahlun commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kahlun commented Apr 7, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant