[1/7][hardware, fsdp, worker] feat: add Intel XPU support for FSDP engine and workers by kahlun · Pull Request #5 · kahlun/verl

kahlun · 2026-04-07T07:42:28Z

Summary

Register XPU in FSDP EngineRegistry (device=["cuda", "npu", "xpu"])
Add XCCL ReduceOp.AVG workaround (SUM + divide) in engine_workers.py and sft_trainer.py
Add XCCL ReduceOp.MAX CPU routing workaround in seqlen_balancing.py
Force eager attention on XPU in fsdp_workers.py (auto-disable torch.compile on multi-GPU)
Add torch.xpu.manual_seed() in engine/utils.py
XPU default attention implementation in config/model.py

Depends on: PR A (core device abstraction, xpu/1-core-device-abstraction)

Includes PR A foundation (device.py, distributed.py, fsdp_utils.py, ray/base.py, worker.py) for self-contained review.

Hardware tested

4× Intel Arc Pro B60 (Battlemage, 24 GB VRAM, PCIe)
PyTorch 2.10.0+xpu (container), 2.11.0+xpu (host)

Test results

T1.1: 1-GPU GRPO LoRA — 16 steps, 125 tok/s
T1.2: 2-GPU GRPO — 16 steps
T1.4: 4-GPU GRPO — 2 steps, 41s/step
T2.1-T2.5: SFT variants (1/4 GPU, Full/LoRA/VLM) — all PASS
All 14 RL algorithms validated (GRPO, PPO, RLOO, REINFORCE++, DPPO, ReMax, etc.)

Test plan

Existing CUDA unit tests still pass (no functional change to CUDA/NPU paths)
XPU FSDP SFT 1-GPU: loss decreasing, checkpoint saved
XPU FSDP GRPO 1-GPU: 16+ steps, valid metrics

- Add is_torch_xpu_available() and is_xpu_available flag - Extend get_device_name() to return "xpu" when XPU is available - Extend get_nccl_backend() to return "xccl" for XPU - Extend get_resource_name() to return "xpu" for Ray resources - Add get_default_attention_implementation() → "eager" for XPU (flash_attn package is CUDA-only; XPU uses PyTorch SDPA instead) - Extend get_torch_device() to return torch.xpu namespace - Extend is_support_ipc() to return False for XPU (no SYCL IPC yet) No behavioral change for existing CUDA/NPU paths.

- Composite distributed backend "cpu:gloo,xpu:xccl" (both CPU and XPU tensors need to be supported in the same process group) - all_reduce_avg() workaround: XCCL lacks ReduceOp.AVG, use SUM + divide - FSDP2 set_force_sum_reduction_for_comms(True) for reduce_scatter These workarounds are temporary — they will be removed when oneCCL adds native AVG support.

- Ray resource detection: check for "xpu" custom resource in node info - Placement group: request {"xpu": num_gpus} for XPU workers - Worker local_rank: derive from RANK % LOCAL_WORLD_SIZE when Ray doesn't recognize XPU as a native accelerator type

…and workers - Register XPU in FSDP EngineRegistry - XCCL ReduceOp.AVG/MAX workarounds in engine_workers + sft_trainer + seqlen_balancing - Force eager attention, auto-disable torch.compile on XPU multi-GPU - torch.xpu.manual_seed() in engine/utils.py

The force-override was a safety measure while SDPA was unvalidated. SDPA is now benchmarked and confirmed 10-22x faster than eager on XPU (Arc Pro B60, PyTorch 2.11, bf16). The NaN issue only affects explicit bool masks with all-False rows (left-padded generation), not training with is_causal=True. device.py now returns "sdpa" for XPU via get_default_attention_implementation(). Removing the override lets that take effect for both actor and critic.

F.sdpa dispatches to Intel SYCL-TLA Flash kernel on XPU (10-22x faster). Benchmarked on Arc Pro B60, PyTorch 2.11, bf16.

kahlun changed the title ~~[XPU 2/6] Add FSDP engine and worker support for Intel XPU~~ [1/6][hardware, fsdp, worker] feat: add Intel XPU support for FSDP engine and workers Apr 7, 2026

kahlun force-pushed the xpu/pr-b-fsdp-workers branch from 73fa330 to 1c56742 Compare April 7, 2026 07:51

kahlun changed the base branch from main to xpu/1-core-device-abstraction April 7, 2026 07:51

kahlun force-pushed the xpu/pr-b-fsdp-workers branch from 1c56742 to 1ce3b64 Compare April 7, 2026 08:05

kahlun mentioned this pull request Apr 7, 2026

[4/7][fsdp] feat: implement Pipeline Parallel for TorchTitan engine #10

Open

3 tasks

kahlun changed the title ~~[1/6][hardware, fsdp, worker] feat: add Intel XPU support for FSDP engine and workers~~ [1/7][hardware, fsdp, worker] feat: add Intel XPU support for FSDP engine and workers Apr 7, 2026

kahlun mentioned this pull request Apr 7, 2026

[XPU 2/3] Integrate XPU support into training pipeline #3

Closed

11 tasks

kahlun force-pushed the xpu/pr-b-fsdp-workers branch from 1ce3b64 to 0ed5d94 Compare April 7, 2026 09:14

kahlun added 5 commits April 7, 2026 03:54

Merge Ray integration into XCCL workarounds

60dd8a5

kahlun force-pushed the xpu/pr-b-fsdp-workers branch from 0ed5d94 to 2823ab0 Compare April 7, 2026 11:07

kahlun changed the base branch from xpu/1-core-device-abstraction to xpu/pr-a1-xccl-workarounds April 7, 2026 11:08

kahlun force-pushed the xpu/pr-a1-xccl-workarounds branch from 2e3513a to 201cd50 Compare April 7, 2026 11:09

kahlun added 2 commits April 8, 2026 23:38

fix: change XPU attention default from "eager" to "sdpa"

039db5f

F.sdpa dispatches to Intel SYCL-TLA Flash kernel on XPU (10-22x faster). Benchmarked on Arc Pro B60, PyTorch 2.11, bf16.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[1/7][hardware, fsdp, worker] feat: add Intel XPU support for FSDP engine and workers#5

[1/7][hardware, fsdp, worker] feat: add Intel XPU support for FSDP engine and workers#5
kahlun wants to merge 7 commits intoxpu/pr-a1-xccl-workaroundsfrom
xpu/pr-b-fsdp-workers

kahlun commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kahlun commented Apr 7, 2026

Summary

Hardware tested

Test results

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant