Skip to content

[flex_shard] Plan: fp8 all-gather on GroupedRaggedShard (block-wise)#3537

Draft
weifengpy wants to merge 8 commits into
gh/weifengpy/31/basefrom
gh/weifengpy/31/head
Draft

[flex_shard] Plan: fp8 all-gather on GroupedRaggedShard (block-wise)#3537
weifengpy wants to merge 8 commits into
gh/weifengpy/31/basefrom
gh/weifengpy/31/head

Conversation

@weifengpy

@weifengpy weifengpy commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced
shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles,
making per-rank quantization local and the gathered fp8 weight bit-identical to
gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion
design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master /
fp8-compute split (comm optimization, not at-rest memory).

[ghstack-poisoned]
[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Jun 4, 2026
Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced
shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles,
making per-rank quantization local and the gathered fp8 weight bit-identical to
gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion
design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master /
fp8-compute split (comm optimization, not at-rest memory).

ghstack-source-id: b21d5c4
Pull-Request: #3537
[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Jun 4, 2026
Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced
shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles,
making per-rank quantization local and the gathered fp8 weight bit-identical to
gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion
design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master /
fp8-compute split (comm optimization, not at-rest memory).

ghstack-source-id: 6731673
Pull-Request: #3537
weifengpy added a commit that referenced this pull request Jun 8, 2026
Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced
shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles,
making per-rank quantization local and the gathered fp8 weight bit-identical to
gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion
design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master /
fp8-compute split (comm optimization, not at-rest memory).

ghstack-source-id: 6731673
Pull-Request: #3537
[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Jun 8, 2026
Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced
shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles,
making per-rank quantization local and the gathered fp8 weight bit-identical to
gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion
design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master /
fp8-compute split (comm optimization, not at-rest memory).

ghstack-source-id: 391c974
Pull-Request: #3537
[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Jun 8, 2026
Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced
shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles,
making per-rank quantization local and the gathered fp8 weight bit-identical to
gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion
design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master /
fp8-compute split (comm optimization, not at-rest memory).

ghstack-source-id: be2d9ea
Pull-Request: #3537
weifengpy added a commit that referenced this pull request Jun 9, 2026
Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced
shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles,
making per-rank quantization local and the gathered fp8 weight bit-identical to
gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion
design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master /
fp8-compute split (comm optimization, not at-rest memory).

ghstack-source-id: be2d9ea
Pull-Request: #3537
weifengpy added a commit that referenced this pull request Jun 9, 2026
Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced
shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles,
making per-rank quantization local and the gathered fp8 weight bit-identical to
gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion
design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master /
fp8-compute split (comm optimization, not at-rest memory).

ghstack-source-id: be2d9ea
Pull-Request: #3537
[ghstack-poisoned]
weifengpy added a commit that referenced this pull request Jun 9, 2026
Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced
shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles,
making per-rank quantization local and the gathered fp8 weight bit-identical to
gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion
design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master /
fp8-compute split (comm optimization, not at-rest memory).

ghstack-source-id: 5a171d5
Pull-Request: #3537
weifengpy added a commit that referenced this pull request Jun 9, 2026
Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced
shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles,
making per-rank quantization local and the gathered fp8 weight bit-identical to
gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion
design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master /
fp8-compute split (comm optimization, not at-rest memory).

ghstack-source-id: 5a171d5
Pull-Request: #3537
weifengpy added 2 commits June 9, 2026 15:22
[ghstack-poisoned]
[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant