[4/7][fsdp] feat: implement Pipeline Parallel for TorchTitan engine#10
Open
kahlun wants to merge 5 commits intoxpu/pr-b-fsdp-workersfrom
Open
[4/7][fsdp] feat: implement Pipeline Parallel for TorchTitan engine#10kahlun wants to merge 5 commits intoxpu/pr-b-fsdp-workersfrom
kahlun wants to merge 5 commits intoxpu/pr-b-fsdp-workersfrom
Conversation
Add core infrastructure for Intel XPU (Arc/Flex GPU) support: - XPU device detection and resource management (device.py) * is_xpu_available flag and detection * get_default_attention_implementation() - returns 'eager' for XPU * XPU backend support (xccl for distributed) * ONEAPI_DEVICE_SELECTOR env var handling - Distributed backend workarounds (distributed.py) * all_reduce_avg() wrapper for oneCCL limitation * oneCCL doesn't support ReduceOp.AVG, uses SUM+divide * Composite backend support: 'cpu:gloo,xpu:xccl' - FSDP2 compatibility (fsdp_utils.py) * Auto-apply set_force_sum_reduction_for_comms(True) for XPU * Workaround for reduce_scatter AVG limitation - Ray integration (single_controller) * XPU resource detection for Ray workers * Backend selection for distributed initialization This provides the foundation for running verl training on Intel GPUs with PyTorch 2.10+ XPU backend. Tested: PyTorch 2.11.0+xpu on Intel Arc Pro B60
This was referenced Apr 7, 2026
…and workers - Register XPU in FSDP EngineRegistry (device=["cuda", "npu", "xpu"]) - XCCL ReduceOp.AVG workaround (SUM + divide) in engine_workers + sft_trainer - XCCL ReduceOp.MAX CPU routing in seqlen_balancing - Force eager attention on XPU, auto-disable torch.compile on multi-GPU - torch.xpu.manual_seed() in engine/utils.py - XPU default attention implementation in config/model.py
- Implement _pp_forward_backward_batch() for PP training + inference - Add _make_pp_dummy_output() for non-last PP stage ranks - Route forward_backward_batch() to PP path when pp_enabled - Add TP sequence-length padding (pad to seq_len_divisor before forward, strip after) — fixes torchtitan verl-project#1306 for non-divisible seq_len - Uniform-length padding across PP micro-batches for schedule compatibility
1ce3b64 to
0ed5d94
Compare
b325a25 to
58acff1
Compare
0ed5d94 to
2823ab0
Compare
device.py: return "sdpa" for XPU (SYCL-TLA Flash, 10-22x faster than eager) fsdp_workers.py: remove force-override to "eager" (safe with is_causal=True)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implement Pipeline Parallel (PP) for TorchTitan engine. Previously raised
NotImplementedError.This is NOT XPU-specific — it benefits CUDA, NPU, and XPU equally.
Changes
_pp_forward_backward_batch(): Per-microbatch PP schedule execution (training + inference)_make_pp_dummy_output(): Zero-filled output for non-last PP stage ranksforward_backward_batch()router: Detectpp_enabled→ dispatch to PP pathinput_idsto nearestseq_len_divisorbefore forward, strip after (fixes torchtitan ray.exceptions.RayTaskError(SyntaxError): <no detail available> verl-project/verl#1306)PP Schedules Supported
All 7 TorchTitan schedules work: GPipe, 1F1B, Interleaved1F1B, InterleavedZeroBubble, ZBVZeroBubble, DualPipeV, LoopedBFS.
Depends on: #5 (FSDP engine)
Test plan