Skip to content

[1/7][hardware, fsdp, worker] feat: add Intel XPU support for FSDP engine and workers#5

Open
kahlun wants to merge 7 commits intoxpu/pr-a1-xccl-workaroundsfrom
xpu/pr-b-fsdp-workers
Open

[1/7][hardware, fsdp, worker] feat: add Intel XPU support for FSDP engine and workers#5
kahlun wants to merge 7 commits intoxpu/pr-a1-xccl-workaroundsfrom
xpu/pr-b-fsdp-workers

Conversation

@kahlun
Copy link
Copy Markdown
Owner

@kahlun kahlun commented Apr 7, 2026

Summary

  • Register XPU in FSDP EngineRegistry (device=["cuda", "npu", "xpu"])
  • Add XCCL ReduceOp.AVG workaround (SUM + divide) in engine_workers.py and sft_trainer.py
  • Add XCCL ReduceOp.MAX CPU routing workaround in seqlen_balancing.py
  • Force eager attention on XPU in fsdp_workers.py (auto-disable torch.compile on multi-GPU)
  • Add torch.xpu.manual_seed() in engine/utils.py
  • XPU default attention implementation in config/model.py

Depends on: PR A (core device abstraction, xpu/1-core-device-abstraction)

Includes PR A foundation (device.py, distributed.py, fsdp_utils.py, ray/base.py, worker.py) for self-contained review.

Hardware tested

  • 4× Intel Arc Pro B60 (Battlemage, 24 GB VRAM, PCIe)
  • PyTorch 2.10.0+xpu (container), 2.11.0+xpu (host)

Test results

  • T1.1: 1-GPU GRPO LoRA — 16 steps, 125 tok/s
  • T1.2: 2-GPU GRPO — 16 steps
  • T1.4: 4-GPU GRPO — 2 steps, 41s/step
  • T2.1-T2.5: SFT variants (1/4 GPU, Full/LoRA/VLM) — all PASS
  • All 14 RL algorithms validated (GRPO, PPO, RLOO, REINFORCE++, DPPO, ReMax, etc.)

Test plan

  • Existing CUDA unit tests still pass (no functional change to CUDA/NPU paths)
  • XPU FSDP SFT 1-GPU: loss decreasing, checkpoint saved
  • XPU FSDP GRPO 1-GPU: 16+ steps, valid metrics

@kahlun kahlun changed the title [XPU 2/6] Add FSDP engine and worker support for Intel XPU [1/6][hardware, fsdp, worker] feat: add Intel XPU support for FSDP engine and workers Apr 7, 2026
@kahlun kahlun force-pushed the xpu/pr-b-fsdp-workers branch from 73fa330 to 1c56742 Compare April 7, 2026 07:51
@kahlun kahlun changed the base branch from main to xpu/1-core-device-abstraction April 7, 2026 07:51
@kahlun kahlun force-pushed the xpu/pr-b-fsdp-workers branch from 1c56742 to 1ce3b64 Compare April 7, 2026 08:05
@kahlun kahlun changed the title [1/6][hardware, fsdp, worker] feat: add Intel XPU support for FSDP engine and workers [1/7][hardware, fsdp, worker] feat: add Intel XPU support for FSDP engine and workers Apr 7, 2026
@kahlun kahlun force-pushed the xpu/pr-b-fsdp-workers branch from 1ce3b64 to 0ed5d94 Compare April 7, 2026 09:14
kahlun added 5 commits April 7, 2026 03:54
- Add is_torch_xpu_available() and is_xpu_available flag
- Extend get_device_name() to return "xpu" when XPU is available
- Extend get_nccl_backend() to return "xccl" for XPU
- Extend get_resource_name() to return "xpu" for Ray resources
- Add get_default_attention_implementation() → "eager" for XPU
  (flash_attn package is CUDA-only; XPU uses PyTorch SDPA instead)
- Extend get_torch_device() to return torch.xpu namespace
- Extend is_support_ipc() to return False for XPU (no SYCL IPC yet)

No behavioral change for existing CUDA/NPU paths.
- Composite distributed backend "cpu:gloo,xpu:xccl" (both CPU and XPU
  tensors need to be supported in the same process group)
- all_reduce_avg() workaround: XCCL lacks ReduceOp.AVG, use SUM + divide
- FSDP2 set_force_sum_reduction_for_comms(True) for reduce_scatter

These workarounds are temporary — they will be removed when oneCCL adds
native AVG support.
- Ray resource detection: check for "xpu" custom resource in node info
- Placement group: request {"xpu": num_gpus} for XPU workers
- Worker local_rank: derive from RANK % LOCAL_WORLD_SIZE when Ray doesn't
  recognize XPU as a native accelerator type
…and workers

- Register XPU in FSDP EngineRegistry
- XCCL ReduceOp.AVG/MAX workarounds in engine_workers + sft_trainer + seqlen_balancing
- Force eager attention, auto-disable torch.compile on XPU multi-GPU
- torch.xpu.manual_seed() in engine/utils.py
@kahlun kahlun force-pushed the xpu/pr-b-fsdp-workers branch from 0ed5d94 to 2823ab0 Compare April 7, 2026 11:07
@kahlun kahlun changed the base branch from xpu/1-core-device-abstraction to xpu/pr-a1-xccl-workarounds April 7, 2026 11:08
@kahlun kahlun force-pushed the xpu/pr-a1-xccl-workarounds branch from 2e3513a to 201cd50 Compare April 7, 2026 11:09
kahlun added 2 commits April 8, 2026 23:38
The force-override was a safety measure while SDPA was unvalidated.
SDPA is now benchmarked and confirmed 10-22x faster than eager on XPU
(Arc Pro B60, PyTorch 2.11, bf16). The NaN issue only affects explicit
bool masks with all-False rows (left-padded generation), not training
with is_causal=True.

device.py now returns "sdpa" for XPU via get_default_attention_implementation().
Removing the override lets that take effect for both actor and critic.
F.sdpa dispatches to Intel SYCL-TLA Flash kernel on XPU (10-22x faster).
Benchmarked on Arc Pro B60, PyTorch 2.11, bf16.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant