[flex_shard] Plan: fp8 all-gather on GroupedRaggedShard (block-wise) by weifengpy · Pull Request #3537 · pytorch/torchtitan

weifengpy · 2026-06-04T21:52:21Z

Stack from ghstack (oldest at bottom):

Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced
shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles,
making per-rank quantization local and the gathered fp8 weight bit-identical to
gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion
design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master /
fp8-compute split (comm optimization, not at-rest memory).

[ghstack-poisoned]

Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles, making per-rank quantization local and the gathered fp8 weight bit-identical to gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master / fp8-compute split (comm optimization, not at-rest memory). ghstack-source-id: b21d5c4 Pull-Request: #3537

[ghstack-poisoned]

Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles, making per-rank quantization local and the gathered fp8 weight bit-identical to gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master / fp8-compute split (comm optimization, not at-rest memory). ghstack-source-id: 6731673 Pull-Request: #3537

[ghstack-poisoned]

Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles, making per-rank quantization local and the gathered fp8 weight bit-identical to gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master / fp8-compute split (comm optimization, not at-rest memory). ghstack-source-id: 391c974 Pull-Request: #3537

[ghstack-poisoned]

Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles, making per-rank quantization local and the gathered fp8 weight bit-identical to gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master / fp8-compute split (comm optimization, not at-rest memory). ghstack-source-id: be2d9ea Pull-Request: #3537

[ghstack-poisoned]

Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles, making per-rank quantization local and the gathered fp8 weight bit-identical to gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master / fp8-compute split (comm optimization, not at-rest memory). ghstack-source-id: 5a171d5 Pull-Request: #3537

[ghstack-poisoned]

Update

a4b4287

[ghstack-poisoned]

pytorch-bot Bot added the ciflow/8gpu label Jun 4, 2026

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 4, 2026

This was referenced Jun 3, 2026

Introduce FlexShard for flexible bucketed parameter sharding #3239

Draft

Make FlexShard traceable by torch.compile #3317

Draft

Add grouped RaggedShard bucket layout #3407

Closed

Add communication-free Muon for FlexShard #3502

Draft

Update

7079f60

[ghstack-poisoned]

Update

ab3dd29

[ghstack-poisoned]

Update

05fef10

[ghstack-poisoned]

Update

6613e0d

[ghstack-poisoned]

Update

7990c8c

[ghstack-poisoned]

weifengpy added 2 commits June 9, 2026 15:22

Update

c4ecf29

[ghstack-poisoned]

Update

029afbf

[ghstack-poisoned]

weifengpy mentioned this pull request Jun 10, 2026

[flex_shard] Add DeepSeek V3 eager training entry point #3603

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[flex_shard] Plan: fp8 all-gather on GroupedRaggedShard (block-wise)#3537

[flex_shard] Plan: fp8 all-gather on GroupedRaggedShard (block-wise)#3537
weifengpy wants to merge 8 commits into
gh/weifengpy/31/basefrom
gh/weifengpy/31/head

weifengpy commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weifengpy commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

weifengpy commented Jun 4, 2026 •

edited

Loading