[1/7][hardware, fsdp, worker] feat: add Intel XPU support for FSDP engine and workers#5
Open
kahlun wants to merge 7 commits intoxpu/pr-a1-xccl-workaroundsfrom
Open
[1/7][hardware, fsdp, worker] feat: add Intel XPU support for FSDP engine and workers#5kahlun wants to merge 7 commits intoxpu/pr-a1-xccl-workaroundsfrom
kahlun wants to merge 7 commits intoxpu/pr-a1-xccl-workaroundsfrom
Conversation
73fa330 to
1c56742
Compare
1c56742 to
1ce3b64
Compare
3 tasks
11 tasks
1ce3b64 to
0ed5d94
Compare
- Add is_torch_xpu_available() and is_xpu_available flag - Extend get_device_name() to return "xpu" when XPU is available - Extend get_nccl_backend() to return "xccl" for XPU - Extend get_resource_name() to return "xpu" for Ray resources - Add get_default_attention_implementation() → "eager" for XPU (flash_attn package is CUDA-only; XPU uses PyTorch SDPA instead) - Extend get_torch_device() to return torch.xpu namespace - Extend is_support_ipc() to return False for XPU (no SYCL IPC yet) No behavioral change for existing CUDA/NPU paths.
- Composite distributed backend "cpu:gloo,xpu:xccl" (both CPU and XPU tensors need to be supported in the same process group) - all_reduce_avg() workaround: XCCL lacks ReduceOp.AVG, use SUM + divide - FSDP2 set_force_sum_reduction_for_comms(True) for reduce_scatter These workarounds are temporary — they will be removed when oneCCL adds native AVG support.
- Ray resource detection: check for "xpu" custom resource in node info
- Placement group: request {"xpu": num_gpus} for XPU workers
- Worker local_rank: derive from RANK % LOCAL_WORLD_SIZE when Ray doesn't
recognize XPU as a native accelerator type
…and workers - Register XPU in FSDP EngineRegistry - XCCL ReduceOp.AVG/MAX workarounds in engine_workers + sft_trainer + seqlen_balancing - Force eager attention, auto-disable torch.compile on XPU multi-GPU - torch.xpu.manual_seed() in engine/utils.py
0ed5d94 to
2823ab0
Compare
2e3513a to
201cd50
Compare
The force-override was a safety measure while SDPA was unvalidated. SDPA is now benchmarked and confirmed 10-22x faster than eager on XPU (Arc Pro B60, PyTorch 2.11, bf16). The NaN issue only affects explicit bool masks with all-False rows (left-padded generation), not training with is_causal=True. device.py now returns "sdpa" for XPU via get_default_attention_implementation(). Removing the override lets that take effect for both actor and critic.
F.sdpa dispatches to Intel SYCL-TLA Flash kernel on XPU (10-22x faster). Benchmarked on Arc Pro B60, PyTorch 2.11, bf16.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
EngineRegistry(device=["cuda", "npu", "xpu"])ReduceOp.AVGworkaround (SUM + divide) inengine_workers.pyandsft_trainer.pyReduceOp.MAXCPU routing workaround inseqlen_balancing.pyfsdp_workers.py(auto-disabletorch.compileon multi-GPU)torch.xpu.manual_seed()inengine/utils.pyconfig/model.pyDepends on: PR A (core device abstraction,
xpu/1-core-device-abstraction)Includes PR A foundation (device.py, distributed.py, fsdp_utils.py, ray/base.py, worker.py) for self-contained review.
Hardware tested
Test results
Test plan