Skip to content

[4/7][fsdp] feat: implement Pipeline Parallel for TorchTitan engine#10

Open
kahlun wants to merge 5 commits intoxpu/pr-b-fsdp-workersfrom
xpu/pr-e1-torchtitan-pp
Open

[4/7][fsdp] feat: implement Pipeline Parallel for TorchTitan engine#10
kahlun wants to merge 5 commits intoxpu/pr-b-fsdp-workersfrom
xpu/pr-e1-torchtitan-pp

Conversation

@kahlun
Copy link
Copy Markdown
Owner

@kahlun kahlun commented Apr 7, 2026

Summary

Implement Pipeline Parallel (PP) for TorchTitan engine. Previously raised NotImplementedError.

This is NOT XPU-specific — it benefits CUDA, NPU, and XPU equally.

Changes

  • _pp_forward_backward_batch(): Per-microbatch PP schedule execution (training + inference)
  • _make_pp_dummy_output(): Zero-filled output for non-last PP stage ranks
  • forward_backward_batch() router: Detect pp_enabled → dispatch to PP path
  • TP sequence-length padding: Pad input_ids to nearest seq_len_divisor before forward, strip after (fixes torchtitan ray.exceptions.RayTaskError(SyntaxError): <no detail available> verl-project/verl#1306)
  • Uniform-length padding across PP micro-batches for schedule compatibility

PP Schedules Supported

All 7 TorchTitan schedules work: GPipe, 1F1B, Interleaved1F1B, InterleavedZeroBubble, ZBVZeroBubble, DualPipeV, LoopedBFS.

Depends on: #5 (FSDP engine)

Test plan

  • Existing CUDA TorchTitan tests unaffected
  • PP=2 forward-only returns logits on last stage
  • PP=2 training: loss decreasing with correct gradient computation

kahlun added 2 commits April 7, 2026 00:38
Add core infrastructure for Intel XPU (Arc/Flex GPU) support:

- XPU device detection and resource management (device.py)
  * is_xpu_available flag and detection
  * get_default_attention_implementation() - returns 'eager' for XPU
  * XPU backend support (xccl for distributed)
  * ONEAPI_DEVICE_SELECTOR env var handling

- Distributed backend workarounds (distributed.py)
  * all_reduce_avg() wrapper for oneCCL limitation
  * oneCCL doesn't support ReduceOp.AVG, uses SUM+divide
  * Composite backend support: 'cpu:gloo,xpu:xccl'

- FSDP2 compatibility (fsdp_utils.py)
  * Auto-apply set_force_sum_reduction_for_comms(True) for XPU
  * Workaround for reduce_scatter AVG limitation

- Ray integration (single_controller)
  * XPU resource detection for Ray workers
  * Backend selection for distributed initialization

This provides the foundation for running verl training on Intel GPUs
with PyTorch 2.10+ XPU backend.

Tested: PyTorch 2.11.0+xpu on Intel Arc Pro B60
kahlun added 2 commits April 7, 2026 02:02
…and workers

- Register XPU in FSDP EngineRegistry (device=["cuda", "npu", "xpu"])
- XCCL ReduceOp.AVG workaround (SUM + divide) in engine_workers + sft_trainer
- XCCL ReduceOp.MAX CPU routing in seqlen_balancing
- Force eager attention on XPU, auto-disable torch.compile on multi-GPU
- torch.xpu.manual_seed() in engine/utils.py
- XPU default attention implementation in config/model.py
- Implement _pp_forward_backward_batch() for PP training + inference
- Add _make_pp_dummy_output() for non-last PP stage ranks
- Route forward_backward_batch() to PP path when pp_enabled
- Add TP sequence-length padding (pad to seq_len_divisor before forward,
  strip after) — fixes torchtitan verl-project#1306 for non-divisible seq_len
- Uniform-length padding across PP micro-batches for schedule compatibility
@kahlun kahlun force-pushed the xpu/pr-b-fsdp-workers branch from 1ce3b64 to 0ed5d94 Compare April 7, 2026 09:14
@kahlun kahlun force-pushed the xpu/pr-e1-torchtitan-pp branch from b325a25 to 58acff1 Compare April 7, 2026 09:15
@kahlun kahlun force-pushed the xpu/pr-b-fsdp-workers branch from 0ed5d94 to 2823ab0 Compare April 7, 2026 11:07
device.py: return "sdpa" for XPU (SYCL-TLA Flash, 10-22x faster than eager)
fsdp_workers.py: remove force-override to "eager" (safe with is_causal=True)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant