[4/7][fsdp] feat: implement Pipeline Parallel for TorchTitan engine by kahlun · Pull Request #10 · kahlun/verl

kahlun · 2026-04-07T08:30:29Z

Summary

Implement Pipeline Parallel (PP) for TorchTitan engine. Previously raised NotImplementedError.

This is NOT XPU-specific — it benefits CUDA, NPU, and XPU equally.

Changes

_pp_forward_backward_batch(): Per-microbatch PP schedule execution (training + inference)
_make_pp_dummy_output(): Zero-filled output for non-last PP stage ranks
forward_backward_batch() router: Detect pp_enabled → dispatch to PP path
TP sequence-length padding: Pad input_ids to nearest seq_len_divisor before forward, strip after (fixes torchtitan ray.exceptions.RayTaskError(SyntaxError): <no detail available> verl-project/verl#1306)
Uniform-length padding across PP micro-batches for schedule compatibility

PP Schedules Supported

All 7 TorchTitan schedules work: GPipe, 1F1B, Interleaved1F1B, InterleavedZeroBubble, ZBVZeroBubble, DualPipeV, LoopedBFS.

Depends on: #5 (FSDP engine)

Test plan

Existing CUDA TorchTitan tests unaffected
PP=2 forward-only returns logits on last stage
PP=2 training: loss decreasing with correct gradient computation

Add core infrastructure for Intel XPU (Arc/Flex GPU) support: - XPU device detection and resource management (device.py) * is_xpu_available flag and detection * get_default_attention_implementation() - returns 'eager' for XPU * XPU backend support (xccl for distributed) * ONEAPI_DEVICE_SELECTOR env var handling - Distributed backend workarounds (distributed.py) * all_reduce_avg() wrapper for oneCCL limitation * oneCCL doesn't support ReduceOp.AVG, uses SUM+divide * Composite backend support: 'cpu:gloo,xpu:xccl' - FSDP2 compatibility (fsdp_utils.py) * Auto-apply set_force_sum_reduction_for_comms(True) for XPU * Workaround for reduce_scatter AVG limitation - Ray integration (single_controller) * XPU resource detection for Ray workers * Backend selection for distributed initialization This provides the foundation for running verl training on Intel GPUs with PyTorch 2.10+ XPU backend. Tested: PyTorch 2.11.0+xpu on Intel Arc Pro B60

…and workers - Register XPU in FSDP EngineRegistry (device=["cuda", "npu", "xpu"]) - XCCL ReduceOp.AVG workaround (SUM + divide) in engine_workers + sft_trainer - XCCL ReduceOp.MAX CPU routing in seqlen_balancing - Force eager attention on XPU, auto-disable torch.compile on multi-GPU - torch.xpu.manual_seed() in engine/utils.py - XPU default attention implementation in config/model.py

- Implement _pp_forward_backward_batch() for PP training + inference - Add _make_pp_dummy_output() for non-last PP stage ranks - Route forward_backward_batch() to PP path when pp_enabled - Add TP sequence-length padding (pad to seq_len_divisor before forward, strip after) — fixes torchtitan verl-project#1306 for non-divisible seq_len - Uniform-length padding across PP micro-batches for schedule compatibility

device.py: return "sdpa" for XPU (SYCL-TLA Flash, 10-22x faster than eager) fsdp_workers.py: remove force-override to "eager" (safe with is_causal=True)

kahlun added 2 commits April 7, 2026 00:38

chore: add Claude Code skills for XPU workflows

91ad95f

This was referenced Apr 7, 2026

[5/7][hardware] feat: enable TorchTitan engine on Intel XPU #11

Open

[XPU 2/3] Integrate XPU support into training pipeline #3

Closed

kahlun added 2 commits April 7, 2026 02:02

kahlun force-pushed the xpu/pr-b-fsdp-workers branch from 1ce3b64 to 0ed5d94 Compare April 7, 2026 09:14

kahlun force-pushed the xpu/pr-e1-torchtitan-pp branch from b325a25 to 58acff1 Compare April 7, 2026 09:15

kahlun force-pushed the xpu/pr-b-fsdp-workers branch from 0ed5d94 to 2823ab0 Compare April 7, 2026 11:07

fix: XPU attention — "eager" → "sdpa" + remove force-override

33ad845

device.py: return "sdpa" for XPU (SYCL-TLA Flash, 10-22x faster than eager) fsdp_workers.py: remove force-override to "eager" (safe with is_causal=True)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[4/7][fsdp] feat: implement Pipeline Parallel for TorchTitan engine#10

[4/7][fsdp] feat: implement Pipeline Parallel for TorchTitan engine#10
kahlun wants to merge 5 commits intoxpu/pr-b-fsdp-workersfrom
xpu/pr-e1-torchtitan-pp

kahlun commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kahlun commented Apr 7, 2026

Summary

Changes

PP Schedules Supported

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant