Skip to content

[ray, single_controller] feat: add XPU resource mapping for Ray#14

Open
kahlun wants to merge 3 commits intoxpu/pr-a0-device-detectionfrom
xpu/pr-a2-ray-integration
Open

[ray, single_controller] feat: add XPU resource mapping for Ray#14
kahlun wants to merge 3 commits intoxpu/pr-a0-device-detectionfrom
xpu/pr-a2-ray-integration

Conversation

@kahlun
Copy link
Copy Markdown
Owner

@kahlun kahlun commented Apr 7, 2026

Summary

Teach Ray's resource system about Intel XPU devices.

  • Resource detection: check node_info.get("xpu", 0) alongside GPU/NPU
  • Placement group: {"xpu": num_gpus} for XPU worker scheduling
  • Worker local_rank: RANK % LOCAL_WORLD_SIZE when Ray doesn't recognize XPU

Depends on: device detection PR (parallel with XCCL workarounds)

Test plan

  • Existing Ray tests unaffected (XPU path only activates on XPU hardware)

kahlun added 2 commits April 7, 2026 03:56
- Add is_torch_xpu_available() and is_xpu_available flag
- Extend get_device_name() to return "xpu" when XPU is available
- Extend get_nccl_backend() to return "xccl" for XPU
- Extend get_resource_name() to return "xpu" for Ray resources
- Add get_default_attention_implementation() → "eager" for XPU
  (flash_attn package is CUDA-only; XPU uses PyTorch SDPA instead)
- Extend get_torch_device() to return torch.xpu namespace
- Extend is_support_ipc() to return False for XPU (no SYCL IPC yet)

No behavioral change for existing CUDA/NPU paths.
- Resource detection: check xpu custom resource in node info
- Placement group: {xpu: num_gpus} for XPU workers
- Worker local_rank from RANK % LOCAL_WORLD_SIZE for XPU
@kahlun kahlun force-pushed the xpu/pr-a0-device-detection branch from 07d47d8 to f381a03 Compare April 7, 2026 11:08
@kahlun kahlun force-pushed the xpu/pr-a0-device-detection branch from f381a03 to 99b68bf Compare April 7, 2026 11:09
@kahlun kahlun force-pushed the xpu/pr-a2-ray-integration branch from 961f46d to 6bc7e34 Compare April 7, 2026 11:09
F.sdpa dispatches to Intel SYCL-TLA Flash kernel on XPU (10-22x faster).
Benchmarked on Arc Pro B60, PyTorch 2.11, bf16.
@kahlun kahlun force-pushed the xpu/pr-a0-device-detection branch 6 times, most recently from 0097030 to fa549d1 Compare April 10, 2026 04:41
kahlun added a commit that referenced this pull request Apr 16, 2026
Hard blocks fixed:
- #1: Replace shell heredoc source-patching with verl/utils/vllm/xpu_patches.py
  that does import-level monkey-patching; no source modification, version-stable
- #2: Remove hardcoded ONEAPI_DEVICE_SELECTOR='level_zero:0,1' from module-level
  PPO_RAY_RUNTIME_ENV dict; gate propagation to XPU hosts only in get_ppo_ray_runtime_env()
- #3: Gate torch.xpu.synchronize() behind VERL_XPU_SYNC_MICROBATCH env var;
  document the actual root cause (oneCCL non-re-entrancy during FSDP collectives)
- #4: Document torch-xpu-ops#3020 in all_gather workaround; add warning at world_size>8

Strong objections fixed:
- #6: Add logger.warning in is_support_ipc() when XPU falls back to shared memory
- #7: Fix Dockerfile sitecustomize path — use site.getsitepackages() not hardcoded
  /usr/local/lib/python3.12/ so it works regardless of Python prefix
- #8: Add hasattr guard for set_force_sum_reduction_for_comms with fallback warning
- #13 (fix #2 side-effect): Restore blank lines in constants_ppo.py

Moderate concerns fixed:
- #11: Improve list() comment — explain oneCCL non-re-entrancy as actual root cause
- #12: Remove numpy<2.0.0 pin (dpctl 0.21.1 does not require it)
- #14: Change 'from None' to 'from e' in create_engine_config exception chain
kahlun added a commit that referenced this pull request Apr 16, 2026
Hard blocks fixed:
- #1: Replace shell heredoc source-patching with verl/utils/vllm/xpu_patches.py
  that does import-level monkey-patching; no source modification, version-stable
- #2: Remove hardcoded ONEAPI_DEVICE_SELECTOR='level_zero:0,1' from module-level
  PPO_RAY_RUNTIME_ENV dict; gate propagation to XPU hosts only in get_ppo_ray_runtime_env()
- #3: Gate torch.xpu.synchronize() behind VERL_XPU_SYNC_MICROBATCH env var;
  document the actual root cause (oneCCL non-re-entrancy during FSDP collectives)
- #4: Document torch-xpu-ops#3020 in all_gather workaround; add warning at world_size>8

Strong objections fixed:
- #6: Add logger.warning in is_support_ipc() when XPU falls back to shared memory
- #7: Fix Dockerfile sitecustomize path — use site.getsitepackages() not hardcoded
  /usr/local/lib/python3.12/ so it works regardless of Python prefix
- #8: Add hasattr guard for set_force_sum_reduction_for_comms with fallback warning
- #13 (fix #2 side-effect): Restore blank lines in constants_ppo.py

Moderate concerns fixed:
- #11: Improve list() comment — explain oneCCL non-re-entrancy as actual root cause
- #12: Remove numpy<2.0.0 pin (dpctl 0.21.1 does not require it)
- #14: Change 'from None' to 'from e' in create_engine_config exception chain
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant