[flex_shard] Plan: fp8 all-gather on GroupedRaggedShard (block-wise)#3537
Draft
weifengpy wants to merge 8 commits into
Draft
[flex_shard] Plan: fp8 all-gather on GroupedRaggedShard (block-wise)#3537weifengpy wants to merge 8 commits into
weifengpy wants to merge 8 commits into
Conversation
This was referenced Jun 3, 2026
weifengpy
added a commit
that referenced
this pull request
Jun 4, 2026
Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles, making per-rank quantization local and the gathered fp8 weight bit-identical to gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master / fp8-compute split (comm optimization, not at-rest memory). ghstack-source-id: b21d5c4 Pull-Request: #3537
weifengpy
added a commit
that referenced
this pull request
Jun 4, 2026
Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles, making per-rank quantization local and the gathered fp8 weight bit-identical to gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master / fp8-compute split (comm optimization, not at-rest memory). ghstack-source-id: 6731673 Pull-Request: #3537
weifengpy
added a commit
that referenced
this pull request
Jun 8, 2026
Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles, making per-rank quantization local and the gathered fp8 weight bit-identical to gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master / fp8-compute split (comm optimization, not at-rest memory). ghstack-source-id: 6731673 Pull-Request: #3537
weifengpy
added a commit
that referenced
this pull request
Jun 8, 2026
Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles, making per-rank quantization local and the gathered fp8 weight bit-identical to gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master / fp8-compute split (comm optimization, not at-rest memory). ghstack-source-id: 391c974 Pull-Request: #3537
weifengpy
added a commit
that referenced
this pull request
Jun 8, 2026
Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles, making per-rank quantization local and the gathered fp8 weight bit-identical to gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master / fp8-compute split (comm optimization, not at-rest memory). ghstack-source-id: be2d9ea Pull-Request: #3537
weifengpy
added a commit
that referenced
this pull request
Jun 9, 2026
Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles, making per-rank quantization local and the gathered fp8 weight bit-identical to gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master / fp8-compute split (comm optimization, not at-rest memory). ghstack-source-id: be2d9ea Pull-Request: #3537
weifengpy
added a commit
that referenced
this pull request
Jun 9, 2026
Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles, making per-rank quantization local and the gathered fp8 weight bit-identical to gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master / fp8-compute split (comm optimization, not at-rest memory). ghstack-source-id: be2d9ea Pull-Request: #3537
weifengpy
added a commit
that referenced
this pull request
Jun 9, 2026
Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles, making per-rank quantization local and the gathered fp8 weight bit-identical to gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master / fp8-compute split (comm optimization, not at-rest memory). ghstack-source-id: 5a171d5 Pull-Request: #3537
weifengpy
added a commit
that referenced
this pull request
Jun 9, 2026
Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles, making per-rank quantization local and the gathered fp8 weight bit-identical to gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master / fp8-compute split (comm optimization, not at-rest memory). ghstack-source-id: 5a171d5 Pull-Request: #3537
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
Design doc for fp8 all-gather on GroupedRaggedShard: align the byte-balanced
shard cut to 128x128 block-wise-quantization tiles so each rank owns whole tiles,
making per-rank quantization local and the gathered fp8 weight bit-identical to
gather-then-quantize. Includes the scale-precompute / horizontal-amax-fusion
design ported from the Float8 All-Gather in FSDP/TP work, and the bf16-master /
fp8-compute split (comm optimization, not at-rest memory).